Variable selection in linear regression models: Choosing the best subset is not always the best choice
Moritz Hanke, Louis Dijkstra, Ronja Foraita, Vanessa Didelez- Statistics, Probability and Uncertainty
- General Medicine
- Statistics and Probability
Abstract
We consider the question of variable selection in linear regressions, in the sense of identifying the correct direct predictors (those variables that have nonzero coefficients given all candidate predictors). Best subset selection (BSS) is often considered the “gold standard,” with its use being restricted only by its NP‐hard nature. Alternatives such as the least absolute shrinkage and selection operator (Lasso) or the Elastic net (Enet) have become methods of choice in high‐dimensional settings. A recent proposal represents BSS as a mixed‐integer optimization problem so that large problems have become computationally feasible. We present an extensive neutral comparison assessing the ability to select the correct direct predictors of BSS compared to forward stepwise selection (FSS), Lasso, and Enet. The simulation considers a range of settings that are challenging regarding dimensionality (number of observations and variables), signal‐to‐noise ratios, and correlations between predictors. As fair measure of performance, we primarily used the best possible F1‐score for each method, and results were confirmed by alternative performance measures and practical criteria for choosing the tuning parameters and subset sizes. Surprisingly, it was only in settings where the signal‐to‐noise ratio was high and the variables were uncorrelated that BSS reliably outperformed the other methods, even in low‐dimensional settings. Furthermore, FSS performed almost identically to BSS. Our results shed new light on the usual presumption of BSS being, in principle, the best choice for selecting the correct direct predictors. Especially for correlated variables, alternatives like Enet are faster and appear to perform better in practical settings.