Demystifying ROC Curves: Understanding Performance Metrics for AI Classification Models

Marco Madrigal
Jul 23, 2024
10 min read

Updated: Aug 1, 2024

Machine learning ROC Curve Blog image generate by AI

Description:

In this blog, we will explore one of the most valuable metrics used for classification models: the ROC curve. We aim to provide you with a clear understanding of this essential metric without going too deeply into complex mathematics. Instead, we'll take a more intuitive approach to explain the concepts, making them accessible to everyone, regardless of their background in data science.

What is ROC-Curve?

In the world of data science, understanding the performance of classification models is a must. The ROC Curve is a powerful tool that exposes the intricacies of model performance and gives clarity for understanding our model behavior.

The name ROC, stands for Receiver Operating Characteristic. It was first coined during World War II where engineers working on detecting aircrafts using radar systems (the receiver) were struggling to detect (the operation) actual aircrafts from all the system noise. They developed the ROC as a tool to measure how good was the radar at recovering aircrafts from noise. Nowadays, the ROC has become and invaluable metric to measure the quality of general machine learning classifiers.

Let's imagine that we have a dataset that could be used for a classification problem. It may be multi-class or binary classification. For better understanding, we are going to keep it simple, just binary. For our example, let's say that our dataset is composed of the medical records of a group of patients, which have been tested for glaucoma. The dataset provides whether a patient presents the disease or not.

Graphically, our dataset looks like the next figure. We have 24 patients, where:

10 suffer from the disease, represented as green points.
14 don’t have the disease, represented by the red points.

Patient data population samples for classification models

Now let's assume we have a very simple algorithm, not even a deep learning model, just a simple threshold based on a specific feature from our dataset. Using this threshold we classify the input samples to predict whether they belong to a patient with glaucoma or not. The image below shows a possible prediction result applied to our dataset. The ground truth is our original dataset labels based on the real medical tests.

Classification model threshold compare to ground truh samples

Now let's dig in what means each dot color, shown in the next table:

Confusion matrix symbols for predictions

True Positive (TP): Patients that have been classified as having the disease and they actually have it.
False Positive (FP): Patients that have been classified as having the disease but they don't have it.
False Negative (FN): Patients that have been classified as not having disease but they actually have it.
True Negative (TN): Patients that have been classified as not having the disease and they actually don’t have it.

The perfect algorithm will only have TP and TN since we expect to predict the true state for every sample, but the world is not that easy 🥲. Different threshold values can yield different classification results. We need to summarize the actual performance of our algorithm so we can compare it with subsequent implementations. For that we need to calculate some values related to the ROC-Curve.

How to calculate the ROC curve

The ROC curve is plotted with True Positive Rate (TPR) against the False Positive Rate (FPR), for different threshold values. TPR is on the y-axis and FPR is on the x-axis, as shown in the figure below:

Let's dig into the mechanics of this process, breaking down metrics involved. For a given threshold value, we can compute:

True Positive Rate (TPR/Sensitivity/Recall): This metric quantifies the proportion of true positive predictions made by our model relative to all actual positive instances, as shown below:

Specificity: Measures the accuracy of our model in predicting true negative cases. It represents the proportion of true negative predictions relative to all actual negative instances.

False Positive Rate (FPR): This metric complements Specificity and quantifies the proportion of false alarms raised by our model relative to all actual negative instances. It is calculated as:

By analyzing Sensitivity/TPR, Specificity, and FPR, we gain a much better understanding of our model's performance across various thresholds. These metrics, not only enable us to construct the ROC curve, but also empower us to fine-tune our models for optimal performance in real-world applications.

Now that we know how to calculate the TPR and FPR, let's continue with the example.

Calculating the TPR and FPR on our dataset!

To calculate the values for FPR and TPR, we need to choose a threshold value, let's start with 0.5, as shown below:

For this case we count as:

TP = 9
FP = 3
FN = 1
TN = 11

Using our formulas, we have the next values:

You can choose any amount of thresholds and calculate the TPR-FPR for each one, in this case we choose 6, which are 1.0, 0.75, 0.5, 0.35, 0.15, and 0.

Now let's start with 0.75, if it was a confidence value from a model, this means that the prediction should be above the 75% threshold to be predicted as positive.

Then we calculate the FRP as FP/(TN+FP) = 0/(0+14) = 0 and TPR = TP/(TP+FN) = 5/(0+10) = 0.5, continuing with different threshold values, we obtained a point value, which is contain by the ROC-Curve.

Threshold variation to calculate ROC Curve points

Now we can plot each point on the ROC-Curve, as shown below:

We have been able to build our ROC curve, but how can we use it to determine the actual performance for our model?

What is the Threshold in a Deep Learning Model?

Some of you may now have an instinctual idea of what the threshold means in a deep learning problem, but let's get a little deeper on this concept.

In classification, a threshold determines the point at which an instance is classified as positive or negative with respect to the confidence value of the model for a specific class. This means that above a certain threshold the output will be classified as positive and below as negative.

As users, we can choose the value of the threshold used in a model, that's how we build or ROC-Curve, but the probability of confidence at the output is inherent from the model, and can not be changed, only by retraining the model. So we play with the threshold in order to find out the best filtering for our use case.

The ROC-Curve is a useful tool for visualizing the impact of different thresholds. By plotting true positive rate against false positive rate at various threshold levels, you can see how changing the threshold affects the model's performance.

Interpretation of AUC-ROC

In the quest to evaluate the performance of our classification models, we've plotted our ROC curve, meticulously calculating True Positive Rate (TPR) and False Positive Rate (FPR). Yet, we need a score to truly know the effectiveness of our models, which we can understand and discern between our different models' performance. To pick up the best one, we need a single metric that encapsulates their overall discriminatory.

For this we have the Area Under the Curve (AUC) of the ROC Curve, represented by the area below the lightblue curve in our visualization below, the AUC serves as a comprehensive measure of our model's performance, offering a value that enables direct comparison between different models. This visualization shows different ROC-Curve for different models with different performance or which is the same different AUC.

Note: On this visualization we can see the threshold represented by the yellow point, in this case 50%, so each point of the curve is a different threshold value.

What does the AUC mean, and how can we interpret its values?

At its essence, the AUC represents the probability that our model will rank a randomly chosen positive instance higher than a randomly chosen negative instance. In other words, a higher AUC indicates better discrimination ability, with values approaching 1 signifying near-perfect performance, while values near 0.5 imply random guessing.

So, how do we interpret the AUC values in practice?

Let's break it down:

AUC equal to 0.5: At this threshold, our model demonstrates no discriminatory ability, essentially a random guessing. It fails to distinguish between positive and negative instances.

AUC between 0.5 and 0.6: Here, our model's performance is lacking, exhibiting weak discrimination and offering little advantage over random guessing.

AUC between 0.6 and 0.75: In this range, our model demonstrates moderate performance, showing some discriminatory ability but with room for improvement.

AUC between 0.75 and 0.9: Now we're talking! Models falling within this range exhibit good performance, effectively distinguishing between positive and negative instances with notable accuracy.

AUC between 0.9 and 0.97: Models in this range are in a league of their own, boasting exceptional discriminatory power and delivering reliable predictions with remarkable precision.

AUC between 0.97 and 1.0: Behold the pinnacle of excellence! Models with AUC values nearing 1 demonstrate near-perfect performance.

In essence, the AUC serves as a score, guiding us to the most effective models and enabling informed decision-making in our data-driven endeavors. Below we can see how visually the AUC looks.

Adjusting the Threshold

Now we have our plots and the best model for our application, but our journey has not ended here. From previous sections we know that we can not change the probability that will give our model each output, but we can choose at which threshold, the classification is going to be positive or negative for a set of confidence values.

You need to be really careful with this and we are going to give you some ideas on how to choose this threshold. As you might have guessed, the ROC curve is a very useful tool to select the ideal one for your application.

First you need to know that adjusting the threshold is a trade-off between sensitivity and specificity, explained in previous sections.

The choice of threshold can significantly impact the performance of your model, depending on the specific requirements of your application.

Lowering the Threshold:
- Increases the number of instances classified as positive.
- Can increase sensitivity (true positive rate) but may also increase false positives.
Raising the Threshold:
- Decreases the number of instances classified as positive.
- Can increase specificity (true negative rate) but may also increase false negatives.

Before diving into the technical aspects, it’s important to understand the context and requirements of your application. Consider the following questions:

What is the cost of false positives? (e.g., incorrectly diagnosing a healthy patient with a disease)
What is the cost of false negatives? (e.g., failing to diagnose a patient who actually has the disease)
Is sensitivity or specificity more important for your application? (e.g., in medical diagnostics, sensitivity might be prioritized to ensure no cases are missed)

But this doesn't tell you how to calculate the threshold value, so let's look 3 methods, that may help you:

Method 1: Youden’s J Statistic

Youden’s J statistic is a useful metric for evaluating the effectiveness of a diagnostic test or a classification model. It is particularly valuable for determining the optimal threshold for such models. For each threshold in the ROC curve, compute the J statistic. The optimal threshold will be that that maximizes J.

Method 2: Cost-Based Analysis

If the costs of false positives and false negatives are known, calculate the total cost for each threshold and choose the one that minimizes the overall cost. This involves assigning a cost value to false positives (FP) and false negatives (FN) and calculating the cost function:

Method 3: Equal Error Rate (EER)

The Equal Error Rate is the point where the false positive rate equals the false negative rate. This method is useful if your application requires a balance between FPR and FNR:

But be careful, just because there are methods to calculate a threshold, doesn’t mean that is suitable for your application. For example, in the medical industry the threshold needs to be really high depending on the type of classification you want to do, where the model needs to be pretty sure about its decision, because the medication can be harmful for the patient otherwise.

Validate the chosen threshold

Get metrics from your test dataset (don’t mix it with validation or training dataset!) and ensure that the chosen threshold generalizes well to unseen data and meets the application’s requirements.

A little more about the ROC-Curve!

To further understand the behavior of the ROC Curve, let's consider its response to varying distributions of TP and TF. Visualizations of these changes let us understand better the dynamics between positive and negative cases.

The red curve tells us the probability of the model classifying correctly a positive result, while the green one a negative result. When there is no intersection between the two distributions, a threshold can be chosen such that it perfectly classifies the samples. This is a perfect scenario!

On the other hand, when the green distribution starts overlapping the red one, the area below the curve is the sum of True Positives, False Positives and False Negatives. As a consequence, that ROC curve on the right starts flattening and, hence, the Area Under the Curve (AUC) gets smaller.

In this example the yellow line, and the corresponding yellow dot, is the chosen threshold. This means a value where below this one we are gonna take the probabilities as negative case and upper this as positive case.

Note: In this context positive means that classification is true and negative means that the classification is false, in other words if you are classifying dogs, these values will mean is a dog or not. In a multiclass it will be the same for each class.

Variation on ROC Curve and AUC base on TPR and FPR

In summary, the ROC Curve serves as a guide to better understand our classification models. This allows us to compare our models deterministically and to choose the optimal threshold for our application. Ultimately, this leads us to a greater and enhanced performance. Thanks to its probabilistic nature, it gives us much better confidence in the reliability of our models performance on real-world data.

Note: If you want to have a better understanding, you can use the next visualization tool Interactive ROC-Curve (created by Kevin Markham).

Conclusion

The ROC-Curve is a powerful tool to evaluate our model performance, choose our optimal threshold, and be sure that we are not biasing our analysis. It provides us with a better grasp of our dataset and a much better understanding of our model's behavior. Nowadays, understanding the AUC-ROC significance and utility is crucial for data scientists and machine learning practitioners.

In this blog we have explored its definition and utilities. Moreover, we saw how it is applied to a real world problem, calculating the ROC-Curve step by step. Finally we get a better understanding of AUC values and how they are calculated and interpreted as a score that reveals the performance of our model in the real-world.

References

Article: La Curva ROC - Elsevier
Article: Understanding AUC-ROC Curve - Towards Data Science
Article: Understanding the ROC Curve in Three Visual Steps - Towards Data Science
Article: ROC Curves and AUC Explained - Data School
Forum: What is Sensitivity in Confusion Matrix? - Data Science Stack Exchange
Wikipedia: Receiver Operating Characteristic - Wikipedia