How ANOVA analyze the variance

Often we Analysis of Variance (ANOVA) to analyze the variances to find if different cases results in similar outcome and if the difference is significant. Following are some simple examples,

  • The effect of different diets on growth of fishes
  • Comparing the height of three different species of a plant
  • Type of flour used for baking a bread

These are some common examples where in some cases data are collected by setting up an experiment and in other cases they are collected through sampling. This article tries to explain how the ANOVA analyze the variance and in what situation are they significant throught both simulated and real data.

Consider the following model with \(i=3\) groups and \(j=n\) observations,

\[y_{ij} = \mu + \tau_i + \varepsilon_{ij}, \; i = 1, 2, 3 \texttt{ and } j = 1, 2, \ldots n\]

here, \(\tau_i\) is the effect corresponding to group \(i\) and \(\varepsilon_{ij} \sim \mathrm{N}(0, \sigma^2)\), the usual assumption of linear model. Simulation example below describe it in details.

Simulation and Real Data Example

Simulation

Consider following three cases the variance of each group is quiet high. These three groups have different average values. Let us simulate only 100 observations,

cases <- tibble(
  group = paste("Group", 1:3),
  mean = c(0.5, 1, 0.2),
  sd = rep(1, 3)
  # sd = c(0.2, 0.8, 0.3)
) %>% 
  group_by(group) %>% 
  mutate(observation = purrr::map2(mean, sd, function(mean, sd) {
    rnorm(nsim, mean, sd) + rnorm(nsim, 0, 1)
  }))
# A tibble: 3 x 4
# Groups:   group [3]
  group    mean    sd observation
  <chr>   <dbl> <dbl> <list>     
1 Group 1   0.5     1 <dbl [100]>
2 Group 2   1       1 <dbl [100]>
3 Group 3   0.2     1 <dbl [100]>

Analysis

Plot

ANOVA

Call:
lm(formula = observation ~ group, data = tidyr::unnest(cases, 
    cols = "observation"))

Residuals:
         Min           1Q       Median           3Q          Max 
-3.735317851 -0.907376098 -0.072698422  1.005383398  4.423257346 

Coefficients:
                 Estimate   Std. Error  t value  Pr(>|t|)
(Intercept)   0.413162844  0.137848996  2.99721 0.0029547
groupGroup 2  0.501159241  0.194947920  2.57073 0.0106354
groupGroup 3 -0.396494086  0.194947920 -2.03385 0.0428557

Residual standard error: 1.37848996 on 297 degrees of freedom
Multiple R-squared:  0.0669128468,  Adjusted R-squared:  0.0606294317 
F-statistic: 10.6491207 on 2 and 297 DF,  p-value: 0.0000341545206
Effects

Post-hoc

Real Data

Lets use the famous iris dataset and try to analyze the difference in the Species based on the Sepal.Length,

iris <- as_tibble(iris)
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows

Analysis

Plot

ANOVA

Call:
lm(formula = Sepal.Length ~ Species, data = iris)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6880 -0.3285 -0.0060  0.3120  1.3120 

Coefficients:
                      Estimate   Std. Error  t value   Pr(>|t|)
(Intercept)       5.0060000000 0.0728022202 68.76164 < 2.22e-16
Speciesversicolor 0.9300000000 0.1029578872  9.03282 8.7702e-16
Speciesvirginica  1.5820000000 0.1029578872 15.36551 < 2.22e-16

Residual standard error: 0.514789436 on 147 degrees of freedom
Multiple R-squared:  0.618705731,   Adjusted R-squared:  0.613518054 
F-statistic: 119.264502 on 2 and 147 DF,  p-value: < 2.220446e-16
Effects

Post-hoc