ROC or receiver operating characteristic curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. Essentially it illustrates the ability of the classifier to segregate the classses. A higher AUC (Area under the curve)-ROC denotes a better classifier
In a ROC curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points of a parameter. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between two diagnostic groups (diseased/normal).
Example: Consider a test which outputs the probability of having a disease (disease vs no-disease classifier)
The diagnostic performance of a test, or the accuray of a test to discriminate diseased cases from normal cases is evaluated using Receiver Operating Characteristic (ROC) curve analysis (Metz, 1978; Zweig & Campbell, 1993). ROC curves can also be used to compare the diagnostic performance of two or more laboratory or diagnostic tests (Griner et al., 1981).
When you consider the results of a particular test in two populations, one population with a disease, the other population without the disease, you will rarely observe a perfect separation between the two groups. Indeed, the distribution of the test results will overlap, as shown in the following figure.
For every possible cut-off point or criterion value you select to discriminate between the two populations, there will be some cases with the disease correctly classified as positive (TP = True Positive fraction), but some cases with the disease will be classified negative (FN = False Negative fraction). On the other hand, some cases without the disease will be correctly classified as negative (TN = True Negative fraction), but some cases without the disease will be classified as positive (FP = False Positive fraction).
Schematic outcomes of a test
The following statistics can be defined:
- Sensitivity: probability that a test result will be positive when the disease is present (true positive rate, expressed as a percentage) = a / (a+b)
- Specificity: probability that a test result will be negative when the disease is not present (true negative rate, expressed as a percentage) = d / (c+d)
- Positive likelihood ratio: ratio between the probability of a positive test result given the presence of the disease and the probability of a positive test result given the absence of the disease, i.e. = True positive rate / False positive rate = Sensitivity / (1-Specificity)
- Negative likelihood ratio: ratio between the probability of a negative test result given the presence of the disease and the probability of a negative test result given the absence of the disease, i.e. = False negative rate / True negative rate = (1-Sensitivity) / Specificity
- Positive predictive value: probability that the disease is present when the test is positive (expressed as a percentage) = a / (a+c)
- Negative predictive value: probability that the disease is not present when the test is negative (expressed as a percentage) = d / (b+d)
Precision & Recall
Sensitivity and specificity versus criterion value
When you select a higher criterion value, the false positive fraction will decrease with increased specificity but on the other hand the true positive fraction and sensitivity will decrease:
When you select a lower threshold value, then the true positive fraction and sensitivity will increase. On the other hand the false positive fraction will also increase, and therefore the true negative fraction and specificity will decrease.
The ROC curve
In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity). Therefore the closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test
Interpreting ROC Curves
Skip the blabber below, directly watch this awesome video
For example, let’s pretend you built a classifier to predict whether a research paper will be admitted to a journal, based on a variety of factors. The features might be the length of the paper, the number of authors, the number of papers those authors have previously submitted to the journal, et cetera. The response (or “output variable”) would be whether or not the paper was admitted.
Let’s first take a look at the bottom portion of this diagram, and ignore the everything except the blue and red distributions. We’ll pretend that every blue and red pixel represents a paper for which you want to predict the admission status. This is your validation (or “hold-out”) set, so you know the true admission status of each paper. The 250 red pixels are the papers that were actually admitted, and the 250 blue pixels are the papers that were not admitted.
Since this is your validation set, you want to judge how well your model is doing by comparing your model’s predictions to the true admission statuses of those 500 papers. We’ll assume that you used a classification method such as logistic regression that can not only make a prediction for each paper, but can also output a predicted probability of admission for each paper. These blue and red distributions are one way to visualize how those predicted probabilities compare to the true statuses.
Let’s examine this plot in detail. The x-axis represents your predicted probabilities, and the y-axis represents a count of observations, kind of like a histogram. Let’s estimate that the height at 0.1 is 10 pixels. This plot tells you that there were 10 papers for which you predicted an admission probability of 0.1, and the true status for all 10 papers was negative (meaning not admitted). There were about 50 papers for which you predicted an admittance probability of 0.3, and none of those 50 were admitted. There were about 20 papers for which you predicted a probability of 0.5, and half of those were admitted and the other half were not. There were 50 papers for which you predicted a probability of 0.7, and all of those were admitted. And so on.
Based on this plot, you might say that your classifier is doing quite well, since it did a good job of separating the classes. To actually make your class predictions, you might set your “threshold” at 0.5, and classify everything above 0.5 as admitted and everything below 0.5 as not admitted, which is what most classification methods will do by default. With that threshold, your accuracy rate would be above 90%, which is probably very good.
Now let’s pretend that your classifier didn’t do nearly as well and move the blue distribution. You can see that there is a lot more overlap here, and regardless of where you set your threshold, your classification accuracy will be much lower than before.
Now let’s talk about the ROC curve that you see here in the upper left. So, what is an ROC curve? It is a plot of the True Positive Rate (on the y-axis) versus the False Positive Rate (on the x-axis) for every possible classification threshold. As a reminder, the True Positive Rate answers the question, “When the actual classification is positive (meaning admitted), how often does the classfier predict positive?” The False Positive Rate answers the question, “When the actual classification is negative (meaning not admitted), how often does the classifier incorrectly predict positive?” Both the True Positive Rate and the False Positive Rate range from 0 to 1.
To see how the ROC curve is actually generated, let’s set some example thresholds for classifying a paper as admitted.
A threshold of 0.8 would classify 50 papers as admitted, and 450 papers as not admitted. The True Positive Rate would be the red pixels to the right of the line divided by all red pixels, or 50 divided by 250, which is 0.2. The False Positive Rate would be the blue pixels to the right of the line divided by all blue pixels, or 0 divided by 250, which is 0. Thus, we would plot a point at 0 on the x-axis, and 0.2 on the y-axis, which is right here.
Let’s set a different threshold of 0.5. That would classify 360 papers as admitted, and 140 papers as not admitted. The True Positive Rate would be 235 divided by 250, or 0.94. The False Positive Rate would be 125 divided by 250, or 0.5. Thus, we would plot a point at 0.5 on the x-axis, and 0.94 on the y-axis, which is right here.
We’ve plotted two points, but to generate the entire ROC curve, all we have to do is to plot the True Positive Rate versus the False Positive Rate for all possible classification thresholds which range from 0 to 1. That is a huge benefit of using an ROC curve to evaluate a classifier instead of a simpler metric such as misclassification rate, in that an ROC curve visualizes all possible classification thresholds, whereas misclassification rate only represents your error rate for a single threshold. Note that you can’t actually see the thresholds used to generate the ROC curve anywhere on the curve itself.
Now, let’s move the blue distribution back to where it was before. Because the classifier is doing a very good job of separating the blues and the reds, I can set a threshold of 0.6, have a True Positive Rate of 0.8, and still have a False Positive Rate of 0.
Therefore, a classifier that does a very good job separating the classes will have an ROC curve that hugs the upper left corner of the plot. Conversely, a classifier that does a very poor job separating the classes will have an ROC curve that is close to this black diagonal line. That line essentially represents a classifier that does no better than random guessing.
Naturally, you might want to use the ROC curve to quantify the performance of a classifier, and give a higher score for this classifier than this classifier. That is the purpose of AUC, which stands for Area Under the Curve. AUC is literally just the percentage of this box that is under this curve. This classifier has an AUC of around 0.8, a very poor classifier has an AUC of around 0.5, and this classifier has an AUC of close to 1.
To Things about this diagram: First, this diagram shows a case where your classes are perfectly balanced, which is why the size of the blue and the red distributions are identical. In most real-world problems, this is not the case. For example, if only 10% of papers were admitted, the blue distribution would be nine times larger than the red distribution. However, that doesn’t change how the ROC curve is generated.
A second note about this diagram is that it shows a case where your predicted probabilities have a very smooth shape, similar to a normal distribution. That was just for demonstration purposes. The probabilities output by your classifier will not necessarily follow any particular shape.
Three other important notes:
The first note is that the ROC curve and AUC are insensitive to whether your predicted probabilities are properly calibrated to actually represent probabilities of class membership. In other words, the ROC curve and the AUC would be identical even if your predicted probabilities ranged from 0.9 to 1 instead of 0 to 1, as long as the ordering of observations by predicted probability remained the same. All the AUC metric cares about is how well your classifier separated the two classes, and thus it is said to only be sensitive to rank ordering. You can think of AUC as representing the probability that a classifier will rank a randomly chosen positive observation higher than a randomly chosen negative observation, and thus it is a useful metric even for datasets with highly unbalanced classes.