# How ANOVA analyze the variance

Often we Analysis of Variance (ANOVA) to analyze the variances to find if different cases results in similar outcome and if the difference is significant. Following are some simple examples,

• The effect of different diets on growth of fishes
• Comparing the height of three different species of a plant
• Type of flour used for baking a bread

These are some common examples where in some cases data are collected by setting up an experiment and in other cases they are collected through sampling. This article tries to explain how the ANOVA analyze the variance and in what situation are they significant throught both simulated and real data.

Consider the following model with $$i=3$$ groups and $$j=n$$ observations,

$y_{ij} = \mu + \tau_i + \varepsilon_{ij}, \; i = 1, 2, 3 \texttt{ and } j = 1, 2, \ldots n$

here, $$\tau_i$$ is the effect corresponding to group $$i$$ and $$\varepsilon_{ij} \sim \mathrm{N}(0, \sigma^2)$$, the usual assumption of linear model. Simulation example below describe it in details.

## Simulation and Real Data Example

### Simulation

Consider following three cases the variance of each group is quiet high. These three groups have different average values. Let us simulate only 100 observations,

cases <- tibble(
group = paste("Group", 1:3),
mean = c(0.5, 1, 0.2),
sd = rep(1, 3)
# sd = c(0.2, 0.8, 0.3)
) %>%
group_by(group) %>%
mutate(observation = purrr::map2(mean, sd, function(mean, sd) {
rnorm(nsim, mean, sd) + rnorm(nsim, 0, 1)
}))
# A tibble: 3 × 4
# Groups:   group 
group    mean    sd observation
<chr>   <dbl> <dbl> <list>
1 Group 1   0.5     1 <dbl >
2 Group 2   1       1 <dbl >
3 Group 3   0.2     1 <dbl >

#### Analysis

##### Plot ##### ANOVA

Call:
lm(formula = observation ~ group, data = tidyr::unnest(cases,
cols = "observation"))

Residuals:
Min      1Q  Median      3Q     Max
-3.7353 -0.9074 -0.0727  1.0054  4.4233

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    0.4132     0.1378   2.997  0.00295 **
groupGroup 2   0.5012     0.1949   2.571  0.01064 *
groupGroup 3  -0.3965     0.1949  -2.034  0.04286 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.378 on 297 degrees of freedom
Multiple R-squared:  0.06691,   Adjusted R-squared:  0.06063
F-statistic: 10.65 on 2 and 297 DF,  p-value: 3.415e-05
##### Effects ##### Post-hoc ### Real Data

Lets use the famous iris dataset and try to analyze the difference in the Species based on the Sepal.Length,

iris <- as_tibble(iris)
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl>       <dbl>        <dbl>       <dbl> <fct>
1          5.1         3.5          1.4         0.2 setosa
2          4.9         3            1.4         0.2 setosa
3          4.7         3.2          1.3         0.2 setosa
4          4.6         3.1          1.5         0.2 setosa
5          5           3.6          1.4         0.2 setosa
6          5.4         3.9          1.7         0.4 setosa
7          4.6         3.4          1.4         0.3 setosa
8          5           3.4          1.5         0.2 setosa
9          4.4         2.9          1.4         0.2 setosa
10          4.9         3.1          1.5         0.1 setosa
# … with 140 more rows

#### Analysis

##### Plot ##### ANOVA

Call:
lm(formula = Sepal.Length ~ Species, data = iris)

Residuals:
Min      1Q  Median      3Q     Max
-1.6880 -0.3285 -0.0060  0.3120  1.3120

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared:  0.6187,    Adjusted R-squared:  0.6135
F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16
##### Effects ##### Post-hoc 