Model assessment and variable selection
Whenever we add a variable in our model, we are not only adding information but also noise that clutter the information and makes the analysis difficult. Simpler model is always better since they contain less noise and they are easy to interpret. In real life, things are not that simple. But relax, there are statistical methods that make model assessment and perform variable selection and gives optimal set of variables for us.
Here, I will discuss how to compare model to know which one is better than other. This model assessment help us to select a model with few variable that can perform as good as a model with large number of variables. Here comes the idea of variable selection. I will try to explain these concepts with an example. For the example I will use mtcars
data set which is available in R by default. Following are two well know procedure of selecting subset of variables,
Best subset method
Best subset procedure selects best regression model by running all possible subset of variables. When number of variable is large, the possible combinations of candidate subset become huge. For example, for 2 variable case (\(X_1\) and \(X_2\)), there are 4 possible subsets – 1 with no predictors, 2 with single predictors and 1 with both predictors. Similarly, if there are 4 variable case (\(X_1, X_2, X_3\) and \(X_4\)), there will be 16 possible subset. In general, fitting all possible subset of large number of predictors becomes computationally intensive.
Once, all the possible model is fitted, they are compared based on some criteria. This model assessment may based on various criteria such as coefficient of determination (\(R^2\)), adjusted \(R^2\), Mallow’s CP, AIC and BIC. In R, leaps package can be used for performing this operation. Lets dig into our mtcars
example.
Fitting a complete model:
full.model <- lm(mpg ~ ., data = mtcars)
smry <- summary(full.model)
full.model <- lm(mpg ~ ., data = mtcars)
smry <- summary(full.model)
smry
Call:
lm(formula = mpg ~ ., data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.450644085 -1.604401795 -0.119605107 1.219267814 4.627094192
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.3033741560 18.7178844287 0.65731 0.518124
cyl -0.1114404779 1.0450233625 -0.10664 0.916087
disp 0.0133352399 0.0178575003 0.74676 0.463489
hp -0.0214821190 0.0217685793 -0.98684 0.334955
drat 0.7871109722 1.6353730686 0.48130 0.635278
wt -3.7153039283 1.8944142995 -1.96119 0.063252
qsec 0.8210407497 0.7308447960 1.12341 0.273941
vs 0.3177628142 2.1045086057 0.15099 0.881423
am 2.5202268872 2.0566505535 1.22540 0.233990
gear 0.6554130171 1.4932599611 0.43891 0.665206
carb -0.1994192549 0.8287524977 -0.24063 0.812179
Residual standard error: 2.65019703 on 21 degrees of freedom
Multiple R-squared: 0.869015764, Adjusted R-squared: 0.806642319
F-statistic: 13.9324637 on 10 and 21 DF, p-value: 3.79315211e-07
Here, we can see that the model has explained almost 86.9 percent of variation present in mpg
, but non of the predictors are significant. This is a hint of having unnecessary variables that has increased model error. Using regsubsets
function from leaps
package, we can select a subset of predictors based on some criteria.
library(leaps)
best.subset <-
regsubsets(
x = mtcars[, -1], # predictor variables
y = mtcars[, 1], # response variable (mpg)
nbest = 1, # top 1 best model
nvmax = ncol(mtcars) - 1, # max. number of variable (all)
method = "exhaustive" # search all possible subset
)
bs.smry <- summary(best.subset)
We can combine following summary output with a plot created from additional estimates to get some insight. These estimates are also found in the summary object. The output show which variables are included with a star(*
).
pander::pander(bs.smry$outmat)
bs.est <- data.frame(
nvar = 1:(best.subset$nvmax - 1),
adj.r2 = round(bs.smry$adjr2, 3),
cp = round(bs.smry$cp, 3),
bic = round(bs.smry$bic, 3)
)
bs.est <- tidyr::gather(bs.est, "estimates", "value", -nvar)
cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|
1 ( 1 ) | * | |||||||||
2 ( 1 ) | * | * | ||||||||
3 ( 1 ) | * | * | * | |||||||
4 ( 1 ) | * | * | * | * | ||||||
5 ( 1 ) | * | * | * | * | * | |||||
6 ( 1 ) | * | * | * | * | * | * | ||||
7 ( 1 ) | * | * | * | * | * | * | * | |||
8 ( 1 ) | * | * | * | * | * | * | * | * | ||
9 ( 1 ) | * | * | * | * | * | * | * | * | * | |
10 ( 1 ) | * | * | * | * | * | * | * | * | * | * |
We can make a plot to visualise the properties of these individual models and select a model with specific number of predictor that can give minimum BIC, or minimum CP or maximum adjusted rsquared.
library(ggplot2)
library(dplyr)
bs.est.select <- bs.est %>%
group_by(estimates) %>%
filter(
(value == max(value) & estimates == "adj.r2") |
(value == min(value) & estimates != "adj.r2")
)
assessment.plt <- ggplot(bs.est, aes(nvar, value, color = estimates)) +
geom_point(shape = 21, fill = "lightgray") +
geom_line() +
facet_wrap(~estimates, scale = 'free_y') +
theme(legend.position = "top") +
labs(x = "Number of variables in the model",
y = "Value of Estimate") +
scale_x_continuous(breaks = seq(0, 10, 2)) +
geom_point(data = bs.est.select, fill = "red",
shape = 21) +
geom_text(aes(label = paste0("nvar:", nvar, '\n', "value:", value)),
data = bs.est.select,
size = 3, hjust = 0, vjust = c(1, -1, -1),
color = "black", family = "monospace")
From these plots, we see that with 5 variables we will obtain maximum adjusted coefficient of determination (\(R^2\)). Similarly, both BIC and Mallow CP will be minimum for models with only 3 predictor variables. With the help of table above, we can identify these variables. From the table, row corresponding to 3 variables, we see that the three predictors are wt
, qsec
and am
. To obtain maximum adjusted \(R^2\), disp
and hp
should be added to the previous 3 predictors.
This way, we can reduce a model to few variables optimising different assessment criteria. Let look at the fit of these reduced models:
model.3 <- lm(mpg ~ wt + qsec + am, data = mtcars)
model.5 <- update(model.3, . ~ . + disp + hp)
Summaries of the models
3 Variable Model
summary(model.3)
Call:
lm(formula = mpg ~ wt + qsec + am, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.481066996 -1.555525403 -0.725730867 1.411012209 4.660998270
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.617780515 6.959592985 1.38195 0.17791517
wt -3.916503725 0.711201635 -5.50688 0.0000069527
qsec 1.225885972 0.288669554 4.24668 0.00021617
am 2.935837192 1.410904515 2.08082 0.04671551
Residual standard error: 2.45884649 on 28 degrees of freedom
Multiple R-squared: 0.849663556, Adjusted R-squared: 0.83355608
F-statistic: 52.7496394 on 3 and 28 DF, p-value: 1.21044643e-11
5 Variable Model
summary(model.5)
Call:
lm(formula = mpg ~ wt + qsec + am + disp + hp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.53986223 -1.73976158 -0.31956287 1.16760234 4.55336831
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.3619039643 9.7407948514 1.47441 0.152378
wt -4.0843320552 1.1940997228 -3.42043 0.002075
qsec 1.0068968313 0.4754328665 2.11785 0.043908
am 3.4704533964 1.4857800860 2.33578 0.027488
disp 0.0112376493 0.0106033311 1.05982 0.298972
hp -0.0211705474 0.0145046917 -1.45957 0.156387
Residual standard error: 2.42929086 on 26 degrees of freedom
Multiple R-squared: 0.863737676, Adjusted R-squared: 0.837533383
F-statistic: 32.9616859 on 5 and 26 DF, p-value: 1.84371712e-10
From these output, it seems that although adjusted \(R^2\) has increased in later model, the additional variables are not significant. we can compare these two model with an ANOVA test which compares the residual variance between these two models. We can write the hypothesis as,
\(H_0:\) Model 1 and Model 2 are same vs \(H_1:\) Model 1 and Model 2 are different
where, Model 1 and Model 2 represents 3 variable and 5 variable model
anova(model.3, model.5)
Analysis of Variance Table
Model 1: mpg ~ wt + qsec + am
Model 2: mpg ~ wt + qsec + am + disp + hp
Res.Df RSS Df Sum of Sq F Pr(>F)
1 28 169.2859295
2 26 153.4378065 2 15.84812304 1.34273 0.27864
The ANOVA result can not reject the hypothesis so claim that Model 1 and Model 2 are same. So, it is better to select the simpler model with 3 predictor variables.
Step-wise selection
When the number of increases, a exhaustive search of all possible subset is computationally intensive. This disadvantage can be overcome by using step-wise selection procedure. A step-wise variable selection procedure can be,
- Forward Selection Procedure
- In this procedure …
- Backward Elimination Procedure
- Here …