# Performance Measures for Feature Selection

In a recent post, I have discussed performance measures for model selection. This time, I write about a related topic: performance measures that are suitable for selecting models when performing feature selection. Since feature selection is concerned with reducing the number of dependent variables, suitable performance measures evaluate the trade-off between the number of features, $$p$$, and the fit of the model.

## Performance measures for regression

Mean squared error (MSE) and $$R^2$$ are unsuited for comparing models during feature selection. According to these measures, a model whose set of features is a superset of the set of features from another model, always has a better performance. By using the adjusted $$R^2$$ or Mallow’s Cp statistic, it is possible to consider both performance and number of features.

Given estimates of the outcome $$\hat{Y}$$ and observed outcomes $$Y$$, the coefficient of determination can be defined as the square of Pearson’s correlation coefficient $$\rho$$:

$R^2 = \rho_{\hat{Y}, Y}^2 = \left(\frac{\text{Cov}(\hat{Y}, Y)}{\sigma_{\hat{Y}} \sigma_Y} \right)^2\,.$

For models with an intercept, $$R^2$$ is in the range $$[0,1]$$. The adjusted R squared adjusts $$R^2$$ according to degrees of freedom of the model, $$n - p -1$$:

$R^2_{\rm{adj}} = 1 - \frac{(1 - R^2) (n-1)}{n - p -1}$

The intuition behind $$R^2$$ is the following:

• $$R^2_{\rm{adj}}$$ increases if the enumerator decreases, that is, if $$R^2$$ is large
• $$R^2_{\rm{adj}}$$ increases if the denominator increases, that is, if $$p$$ is small

When adding additional features to a model, $$R^2_{\rm{adj}}$$ only increases when added predictors sufficiently increase $$R^2$$.

### Adjusted R squared in R

The adjusted R squared can be directly obtained from the summary method of an lm object:

set.seed(1501)
N <- 50
y <- rnorm(N)
set.seed(1001)
y.hat <- y + runif(N, -1, 1)
df.low <- data.frame(Y = y, Y_Hat = y.hat)
model <- lm(Y ~ Y_Hat, data = df.low)
p <- length(model.subset$coefficients) rss.subset <- sum(residuals(model.subset)^2) mse.full <- mean(sum(residuals(model.full)^2)) c <- (rss.subset / mse.full) - N + (2 * p) return(c) } To demonstrate the use of this function, we will use the airquality data set, for which we create three models using subsets of features: data(airquality) ozone <- subset(na.omit(airquality), select = c("Ozone", "Solar.R", "Wind", "Temp")) m1 <- lm(Temp ~ Ozone, data = ozone) m2 <- lm(Temp + Wind ~ Ozone, data = ozone) m.full <- lm(Temp + Wind + Solar.R ~ Ozone, data = ozone) We can now determine the Cp statistic for the three models: c1 <- cp(m1, m.full) c2 <- cp(m2, m.full) c3 <- cp(m.full, m.full) print(paste0("Cps for three models: ", paste(round(c(c1, c2, c3), 3), collapse = ", "))) ##  "Cps for three models: -106.994, -106.993, -106" Since Cp is closer to $$p$$ with increasing number of features, it is worthwhile to use all three features. The negative values of Cp, however, indicate that the model is subject to a high bias. ## Performance measures for classification The Akaike information criterion (AIC) can be used for both, regression and classification. It is defined as $AIC = 2p - 2 \cdot \rm{ln}(\hat{L})$ where $$\hat{L}$$ is the maximum of the likelihood function. A desirable model minimizes the AIC because this is the model that has the best fit (high $$\hat{L}$$) with the fewest possible number of features (low $$p$$). ### Calculating the AIC for generalized linear models For regression models, the AIC is directly available from the summary function for glm objects: m1 <- glm(Temp ~ Ozone, data = ozone) m2 <- glm(Temp + Wind ~ Ozone, data = ozone) m.full <- glm(Temp + Wind + Solar.R ~ Ozone, data = ozone) aics <- c(summary(m1)$aic, summary(m2)$aic, summary(m.full)$aic)
print(paste0("AICs are: ", paste(aics, collapse = ", ")))
##  "AICs are: 746.187677405374, 753.590711437736, 1310.33092056826"

In this case, the AIC indicates that the inclusion of the third feature does not provide a fit-complexity tradeoff.

### Calculating the AIC more generally

Generally, the AIC can be calculated using the AIC function for all models, for which a log likelihood is defined:

aics <- c(AIC(m1), AIC(m2), AIC(m.full))
print(paste0("AICs are: ", paste(aics, collapse = ", ")))
##  "AICs are: 746.187677405374, 753.590711437736, 1310.33092056826"

## Alternatives to these performance measures

Rather than computing performance measures that take model complexity into account, you could also evaluate model performance on a test set (e.g. using cross validation) to prevent overfitting.