Model assessment and variable selection

Whenever we add a variable in our model, we are not only adding information but also noise that clutter the information and makes the analysis difficult. Simpler model is always better since they contain less noise and they are easy to interpret. In real life, things are not that simple. But relax, there are statistical methods that make model assessment and perform variable selection and gives optimal set of variables for us.

Here, I will discuss how to compare model to know which one is better than other. This model assessment help us to select a model with few variable that can perform as good as a model with large number of variables. Here comes the idea of variable selection. I will try to explain these concepts with an example. For the example I will use mtcars data set which is available in R by default. Following are two well know procedure of selecting subset of variables,

Best subset method

Best subset procedure selects best regression model by running all possible subset of variables. When number of variable is large, the possible combinations of candidate subset become huge. For example, for 2 variable case (\(X_1\) and \(X_2\)), there are 4 possible subsets – 1 with no predictors, 2 with single predictors and 1 with both predictors. Similarly, if there are 4 variable case (\(X_1, X_2, X_3\) and \(X_4\)), there will be 16 possible subset. In general, fitting all possible subset of large number of predictors becomes computationally intensive.

Once, all the possible model is fitted, they are compared based on some criteria. This model assessment may based on various criteria such as coefficient of determination (\(R^2\)), adjusted \(R^2\), Mallow’s CP, AIC and BIC. In R, leaps package can be used for performing this operation. Lets dig into our mtcars example.

Fitting a complete model:

full.model <- lm(mpg ~ ., data = mtcars)
smry <- summary(full.model)
smry

Call:
lm(formula = mpg ~ ., data = mtcars)

Residuals:
         Min           1Q       Median           3Q          Max 
-3.450644085 -1.604401795 -0.119605107  1.219267814  4.627094192 

Coefficients:
                 Estimate    Std. Error  t value Pr(>|t|)
(Intercept) 12.3033741560 18.7178844287  0.65731 0.518124
cyl         -0.1114404779  1.0450233625 -0.10664 0.916087
disp         0.0133352399  0.0178575003  0.74676 0.463489
hp          -0.0214821190  0.0217685793 -0.98684 0.334955
drat         0.7871109722  1.6353730686  0.48130 0.635278
wt          -3.7153039283  1.8944142995 -1.96119 0.063252
qsec         0.8210407497  0.7308447960  1.12341 0.273941
vs           0.3177628142  2.1045086057  0.15099 0.881423
am           2.5202268872  2.0566505535  1.22540 0.233990
gear         0.6554130171  1.4932599611  0.43891 0.665206
carb        -0.1994192549  0.8287524977 -0.24063 0.812179

Residual standard error: 2.65019703 on 21 degrees of freedom
Multiple R-squared:  0.869015764,   Adjusted R-squared:  0.806642319 
F-statistic: 13.9324637 on 10 and 21 DF,  p-value: 3.79315211e-07

Here, we can see that the model has explained almost 86.9 percent of variation present in mpg, but non of the predictors are significant. This is a hint of having unnecessary variables that has increased model error. Using regsubsets function from leaps package, we can select a subset of predictors based on some criteria.

library(leaps)
best.subset <-
  regsubsets(
    x      = mtcars[, -1],     # predictor variables
    y      = mtcars[, 1],      # response variable (mpg)
    nbest  = 1,                # top 1 best model
    nvmax  = ncol(mtcars) - 1, # max. number of variable (all)
    method = "exhaustive"      # search all possible subset
  )
bs.smry <- summary(best.subset)

We can combine following summary output with a plot created from additional estimates to get some insight. These estimates are also found in the summary object. The output show which variables are included with a star(*).

pander::pander(bs.smry$outmat)
bs.est <- data.frame(
  nvar   = 1:(best.subset$nvmax - 1),
  adj.r2 = round(bs.smry$adjr2, 3),
  cp     = round(bs.smry$cp, 3),
  bic    = round(bs.smry$bic, 3)
)
bs.est <- tidyr::gather(bs.est, "estimates", "value", -nvar)
  cyl disp hp drat wt qsec vs am gear carb
1 ( 1 ) *
2 ( 1 ) * *
3 ( 1 ) * * *
4 ( 1 ) * * * *
5 ( 1 ) * * * * *
6 ( 1 ) * * * * * *
7 ( 1 ) * * * * * * *
8 ( 1 ) * * * * * * * *
9 ( 1 ) * * * * * * * * *
10 ( 1 ) * * * * * * * * * *
We can make a plot to visualise the properties of these individual models and select a model with specific number of predictor that can give minimum BIC, or minimum CP or maximum adjusted rsquared.
library(ggplot2)
library(dplyr)
bs.est.select <- bs.est %>%
  group_by(estimates) %>%
  filter(
    (value == max(value) & estimates == "adj.r2") |
    (value == min(value) & estimates != "adj.r2")
  )
assessment.plt <- ggplot(bs.est, aes(nvar, value, color = estimates)) +
  geom_point(shape = 21, fill = "lightgray") +
  geom_line() +
  facet_wrap(~estimates, scale = 'free_y') +
  theme(legend.position = "top") +
  labs(x = "Number of variables in the model",
       y = "Value of Estimate") +
  scale_x_continuous(breaks = seq(0, 10, 2)) +
  geom_point(data = bs.est.select, fill = "red",
             shape = 21) +
  geom_text(aes(label = paste0("nvar:", nvar, '\n', "value:", value)),
            data = bs.est.select, 
            size = 3, hjust = 0, vjust = c(1, -1, -1),
            color = "black", family = "monospace")

From these plots, we see that with 5 variables we will obtain maximum adjusted coefficient of determination (\(R^2\)). Similarly, both BIC and Mallow CP will be minimum for models with only 3 predictor variables. With the help of table above, we can identify these variables. From the table, row corresponding to 3 variables, we see that the three predictors are wt, qsec and am. To obtain maximum adjusted \(R^2\), disp and hp should be added to the previous 3 predictors.

This way, we can reduce a model to few variables optimising different assessment criteria. Let look at the fit of these reduced models:

model.3 <- lm(mpg ~ wt + qsec + am, data = mtcars)
model.5 <- update(model.3, . ~ . + disp + hp)

Summaries of the models

3 Variable Model

summary(model.3)

Call:
lm(formula = mpg ~ wt + qsec + am, data = mtcars)

Residuals:
         Min           1Q       Median           3Q          Max 
-3.481066996 -1.555525403 -0.725730867  1.411012209  4.660998270 

Coefficients:
                Estimate   Std. Error  t value     Pr(>|t|)
(Intercept)  9.617780515  6.959592985  1.38195   0.17791517
wt          -3.916503725  0.711201635 -5.50688 0.0000069527
qsec         1.225885972  0.288669554  4.24668   0.00021617
am           2.935837192  1.410904515  2.08082   0.04671551

Residual standard error: 2.45884649 on 28 degrees of freedom
Multiple R-squared:  0.849663556,   Adjusted R-squared:  0.83355608 
F-statistic: 52.7496394 on 3 and 28 DF,  p-value: 1.21044643e-11

5 Variable Model

summary(model.5)

Call:
lm(formula = mpg ~ wt + qsec + am + disp + hp, data = mtcars)

Residuals:
        Min          1Q      Median          3Q         Max 
-3.53986223 -1.73976158 -0.31956287  1.16760234  4.55336831 

Coefficients:
                 Estimate    Std. Error  t value Pr(>|t|)
(Intercept) 14.3619039643  9.7407948514  1.47441 0.152378
wt          -4.0843320552  1.1940997228 -3.42043 0.002075
qsec         1.0068968313  0.4754328665  2.11785 0.043908
am           3.4704533964  1.4857800860  2.33578 0.027488
disp         0.0112376493  0.0106033311  1.05982 0.298972
hp          -0.0211705474  0.0145046917 -1.45957 0.156387

Residual standard error: 2.42929086 on 26 degrees of freedom
Multiple R-squared:  0.863737676,   Adjusted R-squared:  0.837533383 
F-statistic: 32.9616859 on 5 and 26 DF,  p-value: 1.84371712e-10

From these output, it seems that although adjusted \(R^2\) has increased in later model, the additional variables are not significant. we can compare these two model with an ANOVA test which compares the residual variance between these two models. We can write the hypothesis as,

\(H_0:\) Model 1 and Model 2 are same vs \(H_1:\) Model 1 and Model 2 are different

where, Model 1 and Model 2 represents 3 variable and 5 variable model

anova(model.3, model.5)
Analysis of Variance Table

Model 1: mpg ~ wt + qsec + am
Model 2: mpg ~ wt + qsec + am + disp + hp
  Res.Df         RSS Df   Sum of Sq       F  Pr(>F)
1     28 169.2859295                               
2     26 153.4378065  2 15.84812304 1.34273 0.27864

The ANOVA result can not reject the hypothesis so claim that Model 1 and Model 2 are same. So, it is better to select the simpler model with 3 predictor variables.

Step-wise selection

When the number of increases, a exhaustive search of all possible subset is computationally intensive. This disadvantage can be overcome by using step-wise selection procedure. A step-wise variable selection procedure can be,

Forward Selection Procedure
In this procedure …
Backward Elimination Procedure
Here …