The Case Against Precision as a Model Selection Criterion

November 21, 2018 (Last Modified: December 01, 2018)

Recently, I have introduced sensitivity and specificity as performance measures for model selection. Besides these measures, there is also the notion of recall and precision. Precision and recall originate from information retrieval but are also used in machine learning settings. However, the use of precision and recall can be problematic in some situations. In this post, I discuss the shortcomings of recall and precision and show why sensitivity and specificity are generally more useful.

Definitions

For a binary classification problems with classes 0 and 1, the resulting confusion matrix has the following structure:

Prediction/Reference	1	0
1	TP	FP
0	FN	TN

where TP indicates the number of true positives (model correctly predicts positive class), FP indicates the number of false positives (model incorrectly predicts positive class), FN indicates the number of false negatives (model incorrectly predicts negative class), and TN indicates the number of true negatives (model predicts negative class correctly). The definitions of sensitivity (recall), precision (positive predictive value, PPV), and specificity (true negative rate, TNV) are as follows:

\[\begin{align*} \text{sensitivity} &= \text{recall} = TPR = \frac{TP}{TP + FN} \\ \text{precision} &= PPV = \frac{TP}{TP + FP} \\ \text{specificity} &= TNR = 1 - FPR = 1 - \frac{FP}{FP + TN} \\ \end{align*}\]

Sensitivity and precision are related in that they are both using TP in the enumerator. While sensitivity identifies the rate at which observations from the positive class are correctly predicted, precision indicates the rate at which positive predictions are correct. Specificity, on the other hand, is based on the number of false positives and indicates the rate at which observations from the negative class are correctly predicted.

The advantage of sensitivity and specificity

Evaluating a model based on both, sensitivity and specificity, is appropriate for most data sets because these measures consider all entries in the confusion matrix. While sensitivity deals with true positives and false negatives, specificity deals with false positives and true negatives. This means that the combination of sensitivity and specificity is a holistic measure when both true positives and true negatives should be considered.

Sensitivity and specificity can be summarized by a single quantity, the balanced accuracy, which is defined as the mean of both measures:

\[\text{balanced accuracy} = \frac{\text{sensitivity + specificity}}{2} \]

The balanced accuracy is in the range \([0,1]\) where a values of 0 and 1 indicate whe worst-possible and the best-possible classifier, respectively.

The disadvantage of recall and precision

Evaluating a model using recall and precision does not use all cells of the confusion matrix. Recall deals with true positives and false negatives and precision deals with true positives and false positives. Thus, using this pair of performance measures, true negatives are never taken into account. Thus, precision and recall should only be used in situations, where the correct identification of the negative class does not play a role. This is why these measures originate from information retrieval where precision can be defined as

\[\text{precision} = {\frac {|\{{\text{relevant documents}}\}\cap \{{\text{retrieved documents}}\}|}{|\{{\text{retrieved documents}}\}|}}\,.\]

Here, it does not matter at which rate irrelevant documents are correctly discarded (true negative rate) because it is of no consequence.

Precision and recall are often summarized as a single quantity, the F1-score, which is the harmonic mean of both measures:

\[F1 = 2 \frac{\text{recall} \cdot \text{precision}}{\text{recall} + \text{precision}} \]

F1 is in the range \([0,1]\) and will be 1 for a classifier maximizing precision and recall. Since it is based on the harmonic mean, the F1-score is very sensitive towards disparate values for precision and recall. Assume a classifier has a sensitivity of 90% and a precision of 30%. Then the conventional mean would be \(\frac{0.9 + 0.3}{2} = 0.6\) but the harmonic mean (F1 score) would be \(2 \frac{0.9 \cdot 0.3}{0.9 + 0.3} = 0.45\).

Examples

Here, I provide two examples. The first examples investigates what can go wrong when precision is used as a performance metric. The second example shows a setting in which the use of precision is adequate.

What can go wrong when using precision?

Precision is a particularly bad measure when there are few observations that belong to the positive class. Let us assume a clinical data set in which \(90\%\) of persons are diseased (positive class) and only \(10\%\) are healthy (negative class). Let us assume we have developed two tests for classifying whether a patient is diseased or healthy. Both tests have an accuracy of 80% but make different types of errors.

# to use waffle, you need 
#   o FontAwesome
#   o register the fonts using extrafont::font_import()
library(waffle)
ref.colors <- c("#c14141", "#1853b2")
false.colors <- c("#9b3636", "#0e3168")
true.colors <- c("#f75959", "#2474f2")
iron(
    waffle(c("Diseased" = 90, "Healthy" = 10), rows = 5, use_glyph = "child", 
        glyph_size = 5, title = "Reference", colors = ref.colors),
    waffle(c("Diseased (TP)" = 80, "Healthy (FN)" = 10, "Diseased (FP)" = 10), 
        rows = 5, use_glyph = "child", 
        glyph_size = 5, title = "Clinical Test 1", colors = c(true.colors[1], false.colors[2], false.colors[1])),
    waffle(c("Diseased (TP)" = 70, "Healthy (FN)" = 20, "Healthy (TN)" = 10), 
        rows = 5, use_glyph = "child", 
        glyph_size = 5, title = "Clinical Test 2", colors = c(true.colors[1], false.colors[2], true.colors[2]))
)

Confusion matrix for the first test

Prediction/Reference	Diseased	Healthy
Diseased	TP = 80	FP = 10
Healthy	FN = 10	TN = 0

Confusion matrix for the second test

Prediction/Reference	Diseased	Healthy
Diseased	TP = 70	FP = 0
Healthy	FN = 20	TN = 10

Comparison of the two tests

Let us compare the performance of the two tests:

Measure	Test 1	Test 2
Sensitivity (Recall)	88.9%	77.7%
Specificity	0%	100%
Precision	88.9%	100%

Considering sensitivity and specificity, we would not select the first test because its balanced accuracy is merely \(\frac{0 + 0.889}{2} = 44.5\%\), while that of the second test is \(\frac{0.777 + 1}{2} = 88.85\%\).

Using precision and recall, however, the first test would have an F1-score of \(2 \cdot \frac{0.889 \cdot 0.889}{0.889 + 0.889} = 0.889\), while the second test has a lower score of \(2 \cdot \frac{0.777 \cdot 1}{0.777 + 1} \approx 0.87\). Thus, we would find the first test to be superior over the second test although its specificity is a 0%. Thus, when using this test, all of the healthy patients would be classified as diseased. This would be a big problem because all of these patients would undergo severe psychological stress and expensive treatment due to the misdiagnosis. If we had used specificity instead, we would have selected the second model, which does not produce any false postives at a competitive sensitivity.

Use of precision when true negatives do not matter

Let us consider an example from information retrieval to illustrate when precision is a useful criterion. Assume that we want to compare two algorithms for document retrieval that both have have an accuracy of 80%.

library(waffle)
colors <- c("#c14141", "#1853b2")
iron(
    waffle(c("Relevant" = 30, "Irrelevant" = 70), rows = 5, use_glyph = "file", 
        glyph_size = 5, title = "Reference", colors = ref.colors),
    waffle(c("Relevant (TP)" = 25, "Irrelevant (FN)" = 5, "Relevant (FP)" = 15, "Irrelevant (TN)" = 55), 
        rows = 5, use_glyph = "file", 
        glyph_size = 5, title = "Retrieval Algorithm 1", colors = c(true.colors[1], false.colors[2], false.colors[1], true.colors[2])),
    waffle(c("Relevant (TP)" = 20, "Irrelevant (FN)" = 10, "Relevant (FP)" = 10, "Irrelevant (TN)" = 60), 
        rows = 5, use_glyph = "file", 
        glyph_size = 5, title = "Retrieval Algorithm 2", colors = c(true.colors[1], false.colors[2], false.colors[1], true.colors[2]))
)

Confusion matrix for the first algorithm

Prediction/Reference	Relevant	Irrelevant
Relevant	TP = 25	FP = 15
Irrelevant	FN = 5	TN = 55

Confusion matrix for the second algorithm

Prediction/Reference	Relevant	Irrelevant
Relevant	TP = 20	FP = 10
Irrelevant	FN = 10	TN = 60

Comparison of the two algorithms

Let us calculate the performance of the two algorithms from the confusion matrix:

Measure	Algorithm 1	Algorithm 2
Sensitivity (Recall)	83.3%	66.7%
Specificity	78.6%	85.7%
Precision	62.5%	66.7%
Balanced accuracy	80.95%	76.2%
F1-score	71.4%	66.7%

In this example, both balanced accuracy and the F1-score would lead to prefering the first over the second algorithm. Note that the reported balanced accuracy is decidedly larger than the F1-score. This is because specificity is high for both algorithms due to the large number of discarded observations from the negative class. Since the F1-score does not consider the rate of true negatives, precision and recall are more appropriate than sensitivity and specificity for this task.

Summary

In this post, we have seen that performance measures should be carefully selected. While sensitivity and specificity generally perform well, precision and recall should only be used in circumstances where the true negative rate does not play a role.

Comments

Erbo
02 Dec 18 02:15 UTC

Confusion matrix layouts are confusing. ;) Maybe It’s a European thing? I’m used to the top row being TP, FP and columns being “predicted.” Anyways in the last section, I’m getting different sensitivity, specificity, etc values for the second algorithm. I think the counts are off in the confusion matrix. Relevant and Irrelevant totals don’t match the first algorithm totals.

Erbo
02 Dec 18 02:26 UTC

Also in the last section, you conclude with the balanced accuracy choosing the second algorithm after stating the second algorithm had the lower value for that metric. You want the higher accuracy so I assume something is mixed up there too. Enjoyed the post.

Matthias Döring
02 Dec 18 09:52 UTC

Thanks for your comments Erbo. You’re right about the ordering in the table - the positive class should definitely come first, I just changed this. Regarding the ordering of predicted/reference I’m not sure, I think I have seen both versions, so I will keep it as is. I hope it’s not too confusing ;-) I agree, however, that it would make things much easier if there was really a standard for the format of confusion matrices.

In the second example, the rate rates were indeed off for the second example. It should be fixed now :-) Thanks for having a keen eye!

Fluff
26 Nov 19 15:47 UTC

This is an excellent article! There are very few posts out there comparing specificity vs precision approaches from a “Why” perspective and in such a well thought out way. Thank you!

There is a mistake under “What can go wrong when using precision” You say “Precision is a particularly bad measure when there are few observations that belong to the positive class. Let us assume a clinical data set in which 90% of persons are diseased (positive class) and only 10% are healthy (negative class)”

(You’re firstly saying that precision is a bad measure when the positive class is a minority. You then go on to give an example where 90% of population are in positive class.)