New! Use Evidently to evaluate LLM-powered products
Classification metrics guide

How to explain the ROC curve and ROC AUC score?

contents

The ROC AUC score is a popular metric to evaluate the performance of binary classifiers. To compute it, you must measure the area under the ROC curve, which shows the classifier's performance at varying decision thresholds. 

This chapter explains how to plot the ROC curve, compute the ROC AUC and interpret it. We will also showcase it using the open-source Evidently Python library.

TL;DR

  • The ROC curve shows the performance of a binary classifier with different decision thresholds. It plots the True Positive rate (TPR) against the False Positive rate (FPR).
  • The ROC AUC score is the area under the ROC curve. It sums up how well a model can produce relative scores to discriminate between positive or negative instances across all classification thresholds. 
  • The ROC AUC score ranges from 0 to 1, where 0.5 indicates random guessing, and 1 indicates perfect performance.
Evidently Classification Performance Report
Start with AI observability

Want to keep tabs on your classification models? Try Evidently Cloud, a collaborative AI observability platform, powered by the open-source Evidently library with 20m+ downloads.

Start free ⟶ Or try open source ⟶
Evidently Classification Performance Report
Start with AI observability

Want to keep tabs on your ranking models? Try Evidently Cloud, a collaborative AI observability platform, powered by the open-source Evidently library with 20m+ downloads.

Start free ⟶ Or try open source ⟶

What is a ROC curve?

The ROC curve stands for the Receiver Operating Characteristic curve. It is a graphical representation of the performance of a binary classifier at different classification thresholds. 

The curve plots the possible True Positive rates (TPR) against the False Positive rates (FPR).

Here is how the curve can look:

ROC curve chart

Each point on the curve represents a specific decision threshold with a corresponding True Positive rate and False Positive rate.

What is a ROC AUC score?

ROC AUC stands for Receiver Operating Characteristic Area Under the Curve. 

ROC AUC score is a single number that summarizes the classifier's performance across all possible classification thresholds. To get the score, you must measure the area under the ROC curve.

ROC AUC score

ROC AUC score shows how well the classifier distinguishes positive and negative classes. It can take values from 0 to 1.

A higher ROC AUC indicates better performance. A perfect model would have an AUC of 1, while a random model would have an AUC of 0.5.

To understand the ROC AUC metric, it helps to understand the ROC curve first. 

How does the ROC curve work?

Let’s explain it step by step! We will cover:

  • What TPR and FPR are, and how to calculate them
  • What a classification threshold is
  • How to plot a ROC curve

True vs. False Positive rates

The ROC curve plots the True Positive rate (TPR) against the False Positive rate (FPR) at various classification thresholds. You can derive TPR and FPR from a confusion matrix. 

A confusion matrix summarizes all correct and false predictions generated for a specific dataset. Here is an example of a matrix generated for a spam prediction use case:

Confusion matrix example
Source: example matrix from Confusion Matrix chapter.

You can calculate the True Positive and False Positive rates directly from the matrix.

True positive rate and false positive rate

TPR (True Positive rate, also known as recall) shows the share of detected true positives. For example, the share of emails correctly labeled as spam out of all spam emails in the dataset.

To compute the TPR, you must divide the number of True Positives by the total number of objects of the target class – both identified (True Positives) and missed (False Negatives). 

Recall metric formula
In the example confusion matrix above, TPR = 600 / ( 600 + 300) = 0.67. The model successfully detected 67% of all spam emails.

FPR (False Positive rate) shows the share of objects falsely assigned a positive class out of all objects of the negative class. For example, the proportion of legitimate emails falsely labeled as spam.

You can calculate the FPR by dividing the number of False Positives by the total number of objects of the negative class in the dataset.

You can think of the FPR as a "false alarm rate."

False positive rate formula
In our example, FPR = 100 / (100 + 9000) = 0.01. The model falsely flagged 1% of legitimate emails as spam.

To create the ROC curve, you need to plot the FPR values against TPR values at different decision thresholds. 

Classification threshold 

You might ask, what do "different" TPR and FPR values mean? Did we not just calculate them once and for all? 

In fact, we calculated the values for a given confusion matrix at a given decision threshold. But for a probabilistic classification model, these TPR and FPR values are not set in stone. 

You can vary the decision threshold that defines how to convert the model predictions into labels. This, in turn, can change the number of errors the model makes. 

Classification decision threshold

A probabilistic classification model returns a number from 0 to 1 for each object. For example, for each email, it predicts how likely this email is spam. For a given email, it can be 0.1, 0.55, 0.99, or any other number. 

You then have to decide at which probability you convert this prediction to a label. For instance, you can label all emails with a predicted probability of over 0.5 as spam. Or, you can only apply this decision when the score is 0.8 or higher. 

This choice is what sets the classification threshold. 

To better understand the impact of the decision threshold, explore the Classification Threshold chapter in the guide.

As you change the threshold, you will usually get new combinations of errors of different types (and new confusion matrices)!

Confusion matrices with different classification thresholds

When you set the threshold higher, you make the model "more conservative." It assigns the True label when it is "more confident." But as a consequence, you typically lower recall: you detect fewer examples of the target class overall.

When you set the threshold lower, you make the model "less strict." It assigns the True label more often, even when "less confident." Consequently, you increase recall: you will detect more examples of the target class. However, this may also lead to lower precision, as the model may make more False Positive predictions. 

TPR and FPR change in the same direction. The higher the recall (TPR), the higher the rate of false positive errors (FPR). The lower the recall, the fewer false alarms the model gives.

In the example above, the recall (TPR) decreases as we set the different decision higher:

- 0.5 threshold: 800/(800+100)=0.89
- 0.8 threshold: 600/(600+300)=0.67
- 0.95 threshold: 200/(200+700)=0.22
The FPR also goes down:

- 0.5 threshold: 500/(500+8600)=0.06
- 0.8 threshold: 100/(100+9000)=0.01
- 0.95 threshold: 10/(10+9090)=0.001
Confusion matrices with different classification thresholds

Plotting the ROC curve

Now, let’s get back to the curve!

The ROC curve illustrates this trade-off between the TPR and FPR we just explored. Unless your model is near-perfect, you have to balance the two. As you try to increase the TPR (i.e., correctly identify more positive cases), the FPR may also increase (i.e., you get more false alarms). 

For example, the more spam you want to detect, the more legitimate emails you falsely flag as suspicious. 

The ROC curve is a visual representation of this choice. Each point on the curve corresponds to a combination of TPR and FPR values at a specific decision threshold. 

To create the curve, you should plot the FPR values as the x-axis and the TPR values as the y-axis.

If we continue with the example above, here is how it can look.

Plotting the ROC curve

Since our imaginary model does fairly well, most values are "crowded" to the left. 

The left side of the curve corresponds to the more "confident" thresholds: a higher threshold leads to lower recall and fewer false positive errors. The extreme point is when both recall and FPR are 0. In this case, there are no correct detections but also no false ones. 

The right side of the curve represents the "less strict" scenarios when the threshold is low. Both recall and False Positive rates are higher, ultimately reaching 100%. If you put the threshold at 0, the model will always predict a positive class: both recall, and the FPR will be 1.

When you increase the threshold, you move left on the curve. If you decrease the threshold, you move to the right.

ROC curve and decision threshold

Now, let’s take a look at the perfect scenario.

If our model is correct in all the predictions, all the time, it means that the TPR is always 1.0, and FPR is 0. It finds all the cases and never gives false alarms. 

Here is how the ROC curve would look.

Perfect ROC curve

Now, let’s look at the worst-case scenario. 

Let’s say our model is random. In other words, it cannot distinguish between the two classes, and its predictions are no better than chance.

A genuinely random model will predict the positive and negative classes with equal probability. 

ROC curve for a random model

The ROC curve, in this case, will look like a diagonal line connecting points (0,0) and (1,1). For a random classifier, the TPR is equal to the FPR because it makes the same number of true and false positive predictions for any threshold value. As the classification threshold changes, the TPR goes up or down in the same proportion as the FPR.

Most real-world models will fall somewhere between the two extremes. The better the model can distinguish between positive and negative classes, the closer the curve is to the top left corner of the graph.

ROC curve for a real-world model

A ROC curve is a two-dimensional reflection of classifier performance across different thresholds. It is convenient to get a single metric to summarize it. 

This is what the ROC AUC score does.

How to get the ROC AUC score?

A ROC AUC score is a single metric to summarize the performance of a classifier across different thresholds. To compute the score, you must measure the area under the ROC curve.

ROC AUC score

There are different methods to calculate the ROC AUC score, but a common one is a trapezoidal rule. This involves approximating the area under the ROC curve by dividing it into trapezoids with vertical lines at the FPR values and horizontal lines at the TPR values. Then, you compute the area by summing the areas of the trapezoids.

You can compute ROC AUC in Python using sklearn.

If we return to our extreme "perfect" and "random" example, computing the ROC AUC score is easy. In the perfect scenario, we measure the square area: ROC AUC is 1. In the random scenario, it is precisely half: ROC AUC is 0.5.

ROC AUC for perfect and random models

What is a good ROC AUC?

The ROC AUC score can range from 0 to 1. A score of 0.5 indicates random guessing, and a score of 1 indicates perfect performance.

A score slightly above 0.5 shows that a model has at least "some" (albeit small) predictive power. This is generally inadequate for any real applications.

As a rule of thumb, a ROC AUC score above 0.8 is considered good, while a score above 0.9 is considered great. 

However, the usefulness of the model depends on the specific problem and use case. There is no standard. You should interpret the ROC AUC score in context, together with other classification quality metrics, such as accuracy, precision, or recall.

How to explain ROC AUC?

The intuition behind ROC AUC is that it measures how well a binary classifier can distinguish or separate between the positive and negative classes.

It reflects the probability that the model will correctly rank a randomly chosen positive instance higher than a random negative one.

For example, this is how the model predictions might look, arranged by the predicted output scores.

Model output score

ROC AUC reflects the likelihood that a random positive (red) instance will be located to the right of a random negative (gray) instance. 

It shows how well a model can produce good relative scores and generally assign higher probabilities to positive instances over negative ones. 

In the above picture, the classifier is not perfect but "directionally correct." It ranks most negative instances lower than positive ones.  

The ideal situation is to have all positive instances ranked higher than all negative instances, resulting in an AUC of 1.0.

Model output score

It’s worth noting that even a perfect ROC AUC does not mean the predictions are well-calibrated. A well-calibrated classifier produces predicted probabilities that reflect the actual probabilities of the events. Say, if it predicts that an event has a 70% chance of occurring, it should be correct about 70% of the time. ROC AUC is not a calibration measure. 

ROC AUC score, instead, shows how well a model can produce relative scores that help discriminate between positive or negative instances.

ROC AUC pros and cons 

Let’s sum up the important properties of the metric.

Here are some advantages of the ROC AUC score. 

  • A single number. ROC AUC reflects the model quality in one number. It is convenient to use a single metric, especially when comparing multiple models. 
  • Does not change with the classification threshold. Unlike precision and recall, ROC AUC stays the same. In fact, it sums up the performance across the different classification thresholds. It is a valuable "overall" quality measure, whereas precision and recall provide a quality "snapshot" at a given decision threshold.
  • It is a suitable evaluation metric for imbalanced data. ROC AUC measures the model's ability to discriminate between the positive and negative classes, regardless of their relative frequencies in the dataset.
  • More tolerant to the drift in class balance. The ROC AUC generally remains more stable if the distribution of classes changes. This often happens in production use, for example, when fraud rates vary month-by-month. If they change significantly, the earlier chosen decision threshold might become inadequate. For example, if fraud becomes more prevalent, the recall of the fraud detection model might drop, as this metric uses the absolute number of actual fraud cases in the denominator. However, ROC AUC might remain stable, indicating that the model can still differentiate between the two classes despite the changes in their relative frequencies.
  • Scale-invariant. ROC AUC measures how well predictions are ranked rather than their absolute values. This way, it helps compare the quality of models that might output "different ranges" of predicted probabilities. It is typically relevant when you experiment with different models during the model training stage.

The metric also has a few downsides. As usual, a lot depends on the context!

  • ROC AUC is not intuitive. This metric can be hard to explain to business stakeholders and does not have an immediately interpretable meaning.
  • It does not consider the cost of errors. ROC AUC does not account for different types of errors and their consequences. In many scenarios, false negatives can be more costly than false positives, or vice versa. In this case, working to balance precision and recall and setting the appropriate classification threshold to minimize a certain type of error is often a more suitable approach. ROC AUC is not useful for this type of optimization.
  • It can be misleading if the class imbalance is severe. When the positive class is very small, ROC AUC can give a false impression of high quality. Imagine that a classifier predicts almost all instances as negative. TPR and FPR will be close to 0 because there are few positive predictions. As a result, the ROC curve will appear to "hug" the top left corner of the plot, giving the impression that the classifier is performing well and definitely better than random. However, though it correctly classifies most of the negative instances, it may miss most of the positives, which is likely more important for the model performance. In this case, it may be more appropriate to look at the precision-recall curve and rely on metrics like precision, recall, or F1-score to evaluate ML model quality.
Want to see an example of using ROC AUC? We prepared a tutorial on the employee churn prediction problem " What is your model hiding?". You will train two classification models with similar ROC AUC and explore how to compare them. 

When to use ROC AUC 

Considering all the above, ROC AUC is useful, but as usual, not a perfect metric. 

  • During model training, it helps compare multiple ML models against each other.
  • ROC AUC is particularly useful when the goal is to rank predictions in order of their confidence level rather than produce well-calibrated probability estimates.
  • Both in training and production evaluation, ROC AUC helps provide a more complete picture of the model performance, giving a single metric that sums up the quality across different thresholds. 

However, there are limitations:

  • ROC AUC is less useful when you care about different costs of error and want to find the optimal threshold to optimize for the cost of a specific error.  
  • It can be misleading when the data is heavily imbalanced (which is coincidentally often the cases where you ultimately care about different costs of errors).

ROC AUC in ML monitoring 

You can use ROC AUC during production model monitoring as long as you have the true labels to compute it. 

However, a high ROC AUC score does not communicate all relevant aspects of the model quality. The score evaluates the degree of separability and does not consider the asymmetric costs of false positives and negatives. It captures, in one number, the quality of the model across all possible thresholds.

In many real-world scenarios, this overall performance is not relevant: you need to consider the costs of error and define a specific threshold to make automated decisions. Therefore, the ROC AUC score should be used with other metrics, such as precision and recall. You might also want to monitor precision and recall for specific important segments in your data (such as users in specific locations, premium users, etc.) to capture differences in performance.

However, having ROC AUC as an additional metric might still be informative. For example, in cases where the shifting balance of classes might negatively impact recall, tracking ROC AUC might communicate whether the model itself remains reasonable.

ROC curve in Python

ROC curve in Evidently

To quickly calculate and visualize the ROC curve and ROC AUC score, as well as other metrics and plots to evaluate the quality of a classification model, you can use Evidently, an open-source Python library to evaluate, test and monitor ML models in production. 

You will need to prepare your dataset that includes predicted values for each class and true labels and pass it to the tool. You will instantly get an interactive report that includes ROC AUC, accuracy, precision, recall, F1-score metrics as well as other visualizations. You can also integrate these model quality checks into your production pipelines. 

[fs-toc-omit]Try Evidently Cloud
Don’t want to deal with deploying and maintaining the ML monitoring service? Sign up for the Evidently Cloud, a SaaS ML observability platform built on top of Evidently open-source.

Start free ⟶
classification guide
Confusion Matrix
Accuracy, Precision, Recall
Multi-class Precision and Recall
Classification Threshold
ROC AUC score
Get started with AI observability
Sign up Get demo
Try open source

Read next

Accuracy, Precision, Recall

Accuracy reflects the overall correctness of the model. Precision shows how well the model detects the positive class. Recall shows the share of positive class detected by the model. This chapter explains how to choose an appropriate metric considering the use case and the costs of errors.

Classification Threshold

In probabilistic machine learning problems, the model output is not a label but a score. You must then set a decision threshold to assign a specific label to a prediction. This chapter explains how to choose an optimal classification threshold to balance precision and recall.

Get Started with AI Observability

Book a personalized 1:1 demo with our team or start a free 30-day trial.
Start free Get demo
No credit card required
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
Deny Accept

两个鬼故事宝宝起名免费网站大全集给男孩 起个 英文名好学者酒泉公司起名女孩网游起名字商标名字起名许兆君猪宝宝腊月出生起名自动翻译网页新员工入职培训方案男孩起名姓金的猪宝宝起什么乳名好女生能记一辈子的礼物法家拂士邻家有女初长成txt祝福留言代码科技公司英文起名国王游戏漫画合肥车管所属牛的婴儿起名苏姓起名100分男孩一个人的一往情深属猴怎样起名名字名字起名字大全集中国银行跨行转账可以赚钱的软件九龙酒店最新花卉公司起名武汉大学招生办电话uuu9魔兽地图下载少年生前被连续抽血16次?多部门介入两大学生合买彩票中奖一人不认账让美丽中国“从细节出发”淀粉肠小王子日销售额涨超10倍高中生被打伤下体休学 邯郸通报单亲妈妈陷入热恋 14岁儿子报警何赛飞追着代拍打雅江山火三名扑火人员牺牲系谣言张家界的山上“长”满了韩国人?男孩8年未见母亲被告知被遗忘中国拥有亿元资产的家庭达13.3万户19岁小伙救下5人后溺亡 多方发声315晚会后胖东来又人满为患了张立群任西安交通大学校长“重生之我在北大当嫡校长”男子被猫抓伤后确诊“猫抓病”测试车高速逃费 小米:已补缴周杰伦一审败诉网易网友洛杉矶偶遇贾玲今日春分倪萍分享减重40斤方法七年后宇文玥被薅头发捞上岸许家印被限制高消费萧美琴窜访捷克 外交部回应联合利华开始重组专访95后高颜值猪保姆胖东来员工每周单休无小长假男子被流浪猫绊倒 投喂者赔24万小米汽车超级工厂正式揭幕黑马情侣提车了西双版纳热带植物园回应蜉蝣大爆发当地回应沈阳致3死车祸车主疑毒驾恒大被罚41.75亿到底怎么缴妈妈回应孩子在校撞护栏坠楼外国人感慨凌晨的中国很安全杨倩无缘巴黎奥运校方回应护栏损坏小学生课间坠楼房客欠租失踪 房东直发愁专家建议不必谈骨泥色变王树国卸任西安交大校长 师生送别手机成瘾是影响睡眠质量重要因素国产伟哥去年销售近13亿阿根廷将发行1万与2万面值的纸币兔狲“狲大娘”因病死亡遭遇山火的松茸之乡“开封王婆”爆火:促成四五十对奥巴马现身唐宁街 黑色着装引猜测考生莫言也上北大硕士复试名单了德国打算提及普京时仅用姓名天水麻辣烫把捣辣椒大爷累坏了

两个鬼故事 XML地图 TXT地图 虚拟主机 SEO 网站制作 网站优化