Get demo Sign up
contents
The ROC AUC score is a popular metric to evaluate the performance of binary classifiers. To compute it, you must measure the area under the ROC curve, which shows the classifier's performance at varying decision thresholds.
This chapter explains how to plot the ROC curve, compute the ROC AUC and interpret it. We will also showcase it using the open-source Evidently Python library.
Want to keep tabs on your classification models? Try Evidently Cloud, a collaborative AI observability platform, powered by the open-source Evidently library with 20m+ downloads.
Want to keep tabs on your ranking models? Try Evidently Cloud, a collaborative AI observability platform, powered by the open-source Evidently library with 20m+ downloads.
The ROC curve stands for the Receiver Operating Characteristic curve. It is a graphical representation of the performance of a binary classifier at different classification thresholds.
The curve plots the possible True Positive rates (TPR) against the False Positive rates (FPR).
Here is how the curve can look:
Each point on the curve represents a specific decision threshold with a corresponding True Positive rate and False Positive rate.
ROC AUC stands for Receiver Operating Characteristic Area Under the Curve.
ROC AUC score is a single number that summarizes the classifier's performance across all possible classification thresholds. To get the score, you must measure the area under the ROC curve.
ROC AUC score shows how well the classifier distinguishes positive and negative classes. It can take values from 0 to 1.
A higher ROC AUC indicates better performance. A perfect model would have an AUC of 1, while a random model would have an AUC of 0.5.
To understand the ROC AUC metric, it helps to understand the ROC curve first.
Let’s explain it step by step! We will cover:
The ROC curve plots the True Positive rate (TPR) against the False Positive rate (FPR) at various classification thresholds. You can derive TPR and FPR from a confusion matrix.
A confusion matrix summarizes all correct and false predictions generated for a specific dataset. Here is an example of a matrix generated for a spam prediction use case:
You can calculate the True Positive and False Positive rates directly from the matrix.
TPR (True Positive rate, also known as recall) shows the share of detected true positives. For example, the share of emails correctly labeled as spam out of all spam emails in the dataset.
To compute the TPR, you must divide the number of True Positives by the total number of objects of the target class – both identified (True Positives) and missed (False Negatives).
In the example confusion matrix above, TPR = 600 / ( 600 + 300) = 0.67. The model successfully detected 67% of all spam emails.
FPR (False Positive rate) shows the share of objects falsely assigned a positive class out of all objects of the negative class. For example, the proportion of legitimate emails falsely labeled as spam.
You can calculate the FPR by dividing the number of False Positives by the total number of objects of the negative class in the dataset.
You can think of the FPR as a "false alarm rate."
In our example, FPR = 100 / (100 + 9000) = 0.01. The model falsely flagged 1% of legitimate emails as spam.
To create the ROC curve, you need to plot the FPR values against TPR values at different decision thresholds.
You might ask, what do "different" TPR and FPR values mean? Did we not just calculate them once and for all?
In fact, we calculated the values for a given confusion matrix at a given decision threshold. But for a probabilistic classification model, these TPR and FPR values are not set in stone.
You can vary the decision threshold that defines how to convert the model predictions into labels. This, in turn, can change the number of errors the model makes.
A probabilistic classification model returns a number from 0 to 1 for each object. For example, for each email, it predicts how likely this email is spam. For a given email, it can be 0.1, 0.55, 0.99, or any other number.
You then have to decide at which probability you convert this prediction to a label. For instance, you can label all emails with a predicted probability of over 0.5 as spam. Or, you can only apply this decision when the score is 0.8 or higher.
This choice is what sets the classification threshold.
To better understand the impact of the decision threshold, explore the Classification Threshold chapter in the guide.
As you change the threshold, you will usually get new combinations of errors of different types (and new confusion matrices)!
When you set the threshold higher, you make the model "more conservative." It assigns the True label when it is "more confident." But as a consequence, you typically lower recall: you detect fewer examples of the target class overall.
When you set the threshold lower, you make the model "less strict." It assigns the True label more often, even when "less confident." Consequently, you increase recall: you will detect more examples of the target class. However, this may also lead to lower precision, as the model may make more False Positive predictions.
TPR and FPR change in the same direction. The higher the recall (TPR), the higher the rate of false positive errors (FPR). The lower the recall, the fewer false alarms the model gives.
In the example above, the recall (TPR) decreases as we set the different decision higher:
- 0.5 threshold: 800/(800+100)=0.89
- 0.8 threshold: 600/(600+300)=0.67
- 0.95 threshold: 200/(200+700)=0.22
The FPR also goes down:
- 0.5 threshold: 500/(500+8600)=0.06
- 0.8 threshold: 100/(100+9000)=0.01
- 0.95 threshold: 10/(10+9090)=0.001
Now, let’s get back to the curve!
The ROC curve illustrates this trade-off between the TPR and FPR we just explored. Unless your model is near-perfect, you have to balance the two. As you try to increase the TPR (i.e., correctly identify more positive cases), the FPR may also increase (i.e., you get more false alarms).
For example, the more spam you want to detect, the more legitimate emails you falsely flag as suspicious.
The ROC curve is a visual representation of this choice. Each point on the curve corresponds to a combination of TPR and FPR values at a specific decision threshold.
To create the curve, you should plot the FPR values as the x-axis and the TPR values as the y-axis.
If we continue with the example above, here is how it can look.
Since our imaginary model does fairly well, most values are "crowded" to the left.
The left side of the curve corresponds to the more "confident" thresholds: a higher threshold leads to lower recall and fewer false positive errors. The extreme point is when both recall and FPR are 0. In this case, there are no correct detections but also no false ones.
The right side of the curve represents the "less strict" scenarios when the threshold is low. Both recall and False Positive rates are higher, ultimately reaching 100%. If you put the threshold at 0, the model will always predict a positive class: both recall, and the FPR will be 1.
When you increase the threshold, you move left on the curve. If you decrease the threshold, you move to the right.
Now, let’s take a look at the perfect scenario.
If our model is correct in all the predictions, all the time, it means that the TPR is always 1.0, and FPR is 0. It finds all the cases and never gives false alarms.
Here is how the ROC curve would look.
Now, let’s look at the worst-case scenario.
Let’s say our model is random. In other words, it cannot distinguish between the two classes, and its predictions are no better than chance.
A genuinely random model will predict the positive and negative classes with equal probability.
The ROC curve, in this case, will look like a diagonal line connecting points (0,0) and (1,1). For a random classifier, the TPR is equal to the FPR because it makes the same number of true and false positive predictions for any threshold value. As the classification threshold changes, the TPR goes up or down in the same proportion as the FPR.
Most real-world models will fall somewhere between the two extremes. The better the model can distinguish between positive and negative classes, the closer the curve is to the top left corner of the graph.
A ROC curve is a two-dimensional reflection of classifier performance across different thresholds. It is convenient to get a single metric to summarize it.
This is what the ROC AUC score does.
A ROC AUC score is a single metric to summarize the performance of a classifier across different thresholds. To compute the score, you must measure the area under the ROC curve.
There are different methods to calculate the ROC AUC score, but a common one is a trapezoidal rule. This involves approximating the area under the ROC curve by dividing it into trapezoids with vertical lines at the FPR values and horizontal lines at the TPR values. Then, you compute the area by summing the areas of the trapezoids.
You can compute ROC AUC in Python using sklearn.
If we return to our extreme "perfect" and "random" example, computing the ROC AUC score is easy. In the perfect scenario, we measure the square area: ROC AUC is 1. In the random scenario, it is precisely half: ROC AUC is 0.5.
The ROC AUC score can range from 0 to 1. A score of 0.5 indicates random guessing, and a score of 1 indicates perfect performance.
A score slightly above 0.5 shows that a model has at least "some" (albeit small) predictive power. This is generally inadequate for any real applications.
As a rule of thumb, a ROC AUC score above 0.8 is considered good, while a score above 0.9 is considered great.
However, the usefulness of the model depends on the specific problem and use case. There is no standard. You should interpret the ROC AUC score in context, together with other classification quality metrics, such as accuracy, precision, or recall.
The intuition behind ROC AUC is that it measures how well a binary classifier can distinguish or separate between the positive and negative classes.
It reflects the probability that the model will correctly rank a randomly chosen positive instance higher than a random negative one.
For example, this is how the model predictions might look, arranged by the predicted output scores.
ROC AUC reflects the likelihood that a random positive (red) instance will be located to the right of a random negative (gray) instance.
It shows how well a model can produce good relative scores and generally assign higher probabilities to positive instances over negative ones.
In the above picture, the classifier is not perfect but "directionally correct." It ranks most negative instances lower than positive ones.
The ideal situation is to have all positive instances ranked higher than all negative instances, resulting in an AUC of 1.0.
It’s worth noting that even a perfect ROC AUC does not mean the predictions are well-calibrated. A well-calibrated classifier produces predicted probabilities that reflect the actual probabilities of the events. Say, if it predicts that an event has a 70% chance of occurring, it should be correct about 70% of the time. ROC AUC is not a calibration measure.
ROC AUC score, instead, shows how well a model can produce relative scores that help discriminate between positive or negative instances.
Let’s sum up the important properties of the metric.
Here are some advantages of the ROC AUC score.
The metric also has a few downsides. As usual, a lot depends on the context!
Want to see an example of using ROC AUC? We prepared a tutorial on the employee churn prediction problem " What is your model hiding?". You will train two classification models with similar ROC AUC and explore how to compare them.
Considering all the above, ROC AUC is useful, but as usual, not a perfect metric.
However, there are limitations:
You can use ROC AUC during production model monitoring as long as you have the true labels to compute it.
However, a high ROC AUC score does not communicate all relevant aspects of the model quality. The score evaluates the degree of separability and does not consider the asymmetric costs of false positives and negatives. It captures, in one number, the quality of the model across all possible thresholds.
In many real-world scenarios, this overall performance is not relevant: you need to consider the costs of error and define a specific threshold to make automated decisions. Therefore, the ROC AUC score should be used with other metrics, such as precision and recall. You might also want to monitor precision and recall for specific important segments in your data (such as users in specific locations, premium users, etc.) to capture differences in performance.
However, having ROC AUC as an additional metric might still be informative. For example, in cases where the shifting balance of classes might negatively impact recall, tracking ROC AUC might communicate whether the model itself remains reasonable.
To quickly calculate and visualize the ROC curve and ROC AUC score, as well as other metrics and plots to evaluate the quality of a classification model, you can use Evidently, an open-source Python library to evaluate, test and monitor ML models in production.
You will need to prepare your dataset that includes predicted values for each class and true labels and pass it to the tool. You will instantly get an interactive report that includes ROC AUC, accuracy, precision, recall, F1-score metrics as well as other visualizations. You can also integrate these model quality checks into your production pipelines.
Don’t want to deal with deploying and maintaining the ML monitoring service? Sign up for the Evidently Cloud, a SaaS ML observability platform built on top of Evidently open-source.
Start free ⟶
两个鬼故事宝宝起名免费网站大全集给男孩 起个 英文名好学者酒泉公司起名女孩网游起名字商标名字起名许兆君猪宝宝腊月出生起名自动翻译网页新员工入职培训方案男孩起名姓金的猪宝宝起什么乳名好女生能记一辈子的礼物法家拂士邻家有女初长成txt祝福留言代码科技公司英文起名国王游戏漫画合肥车管所属牛的婴儿起名苏姓起名100分男孩一个人的一往情深属猴怎样起名名字名字起名字大全集中国银行跨行转账可以赚钱的软件九龙酒店最新花卉公司起名武汉大学招生办电话uuu9魔兽地图下载少年生前被连续抽血16次?多部门介入两大学生合买彩票中奖一人不认账让美丽中国“从细节出发”淀粉肠小王子日销售额涨超10倍高中生被打伤下体休学 邯郸通报单亲妈妈陷入热恋 14岁儿子报警何赛飞追着代拍打雅江山火三名扑火人员牺牲系谣言张家界的山上“长”满了韩国人?男孩8年未见母亲被告知被遗忘中国拥有亿元资产的家庭达13.3万户19岁小伙救下5人后溺亡 多方发声315晚会后胖东来又人满为患了张立群任西安交通大学校长“重生之我在北大当嫡校长”男子被猫抓伤后确诊“猫抓病”测试车高速逃费 小米:已补缴周杰伦一审败诉网易网友洛杉矶偶遇贾玲今日春分倪萍分享减重40斤方法七年后宇文玥被薅头发捞上岸许家印被限制高消费萧美琴窜访捷克 外交部回应联合利华开始重组专访95后高颜值猪保姆胖东来员工每周单休无小长假男子被流浪猫绊倒 投喂者赔24万小米汽车超级工厂正式揭幕黑马情侣提车了西双版纳热带植物园回应蜉蝣大爆发当地回应沈阳致3死车祸车主疑毒驾恒大被罚41.75亿到底怎么缴妈妈回应孩子在校撞护栏坠楼外国人感慨凌晨的中国很安全杨倩无缘巴黎奥运校方回应护栏损坏小学生课间坠楼房客欠租失踪 房东直发愁专家建议不必谈骨泥色变王树国卸任西安交大校长 师生送别手机成瘾是影响睡眠质量重要因素国产伟哥去年销售近13亿阿根廷将发行1万与2万面值的纸币兔狲“狲大娘”因病死亡遭遇山火的松茸之乡“开封王婆”爆火:促成四五十对奥巴马现身唐宁街 黑色着装引猜测考生莫言也上北大硕士复试名单了德国打算提及普京时仅用姓名天水麻辣烫把捣辣椒大爷累坏了