Performance measures for model validation

Performance measures for supervised learning

Besides interpretability, predictive performance is the most important property of machine learning models. Here, I provide an overview of available performance measures and discuss under which circumstances they are appropriate.

Performance measures for regression

For regression, the most popular performance measures are R squared and the root mean squared error (RMSE). \(R^2\) has the advantage that it is typically in the interval \([0,1]\), which makes it more interpretable than the RMSE, whose value is on the scale of the outcome.

Performance measures for classification

The performance of models for binary classification is evaluated on the basis of confusion matrices, which indicate true positives, false positives, true negatives, and false negatives. Based on these quantities, the performance measures of sensitivity and specificity (balanced accuracy) are derived. In specific circumstances, it is worthwhile to consider recall and precision (the F1 score) rather than sensitivity and specificity.

For scoring classifiers, the area under the receiver operating characterstic curve (AUC) can be used to measure the sensitivity-specificity tradeoff for different classification thresholds.

Performance measures for feature selection

When comparing models with different number of features, model complexity should be taken into account through measures such as the adjusted \(R^2\) or the Akaike information criterion (AIC). Alternatively, to curb overfitting, model performance can be determined on an independent test set (e.g. via cross validation).

Posts about performance measures

The following posts discuss performance leasures for supervised learning and how they can be computed using R.

Performance Measures for Multi-Class Problems

Performance Measures for Multi-Class Problems


For multi-class prediction scenarios, we can use similar performance measures as for binary classification. Here, I explain how we can obtain the (weighted) accuracy, micro- and macro-averaged F1-scores, and a generalization of the AUC to the multi-class setting.

Performance Measures for Model Selection

Performance Measures for Model Selection


One of the main criteria indicating the quality of a machine learning models is its predictive performance. However, suitable performances measures differ depending on the prediction task. This post investigates the most commonly used quantities that are used for selecting regression and classification models.