How ANOVA analyze the variance
Often we Analysis of Variance (ANOVA) to analyze the variances to find if different cases results in similar outcome and if the difference is significant. Following are some simple examples,
- The effect of different diets on growth of fishes
- Comparing the height of three different species of a plant
- Type of flour used for baking a bread
These are some common examples where in some cases data are collected by setting up an experiment and in other cases they are collected through sampling. This article tries to explain how the ANOVA analyze the variance and in what situation are they significant throught both simulated and real data.
Consider the following model with \(i=3\) groups and \(j=n\) observations,
\[y_{ij} = \mu + \tau_i + \varepsilon_{ij}, \; i = 1, 2, 3 \texttt{ and } j = 1, 2, \ldots n\]
here, \(\tau_i\) is the effect corresponding to group \(i\) and \(\varepsilon_{ij} \sim \mathrm{N}(0, \sigma^2)\), the usual assumption of linear model. Simulation example below describe it in details.
Simulation and Real Data Example
Simulation
Consider following three cases the variance of each group is quiet high. These three groups have different average values. Let us simulate only 100 observations,
cases <- tibble(
group = paste("Group", 1:3),
mean = c(0.5, 1, 0.2),
sd = rep(1, 3)
# sd = c(0.2, 0.8, 0.3)
) %>%
group_by(group) %>%
mutate(observation = purrr::map2(mean, sd, function(mean, sd) {
rnorm(nsim, mean, sd) + rnorm(nsim, 0, 1)
}))
# A tibble: 3 × 4
# Groups: group [3]
group mean sd observation
<chr> <dbl> <dbl> <list>
1 Group 1 0.5 1 <dbl [100]>
2 Group 2 1 1 <dbl [100]>
3 Group 3 0.2 1 <dbl [100]>
Analysis
Plot
ANOVA
Call:
lm(formula = observation ~ group, data = tidyr::unnest(cases,
cols = "observation"))
Residuals:
Min 1Q Median 3Q Max
-3.7353 -0.9074 -0.0727 1.0054 4.4233
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4132 0.1378 2.997 0.00295 **
groupGroup 2 0.5012 0.1949 2.571 0.01064 *
groupGroup 3 -0.3965 0.1949 -2.034 0.04286 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.378 on 297 degrees of freedom
Multiple R-squared: 0.06691, Adjusted R-squared: 0.06063
F-statistic: 10.65 on 2 and 297 DF, p-value: 3.415e-05
Effects
Post-hoc
Real Data
Lets use the famous iris
dataset and try to analyze the difference in the Species
based on the Sepal.Length
,
iris <- as_tibble(iris)
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# … with 140 more rows
Analysis
Plot
ANOVA
Call:
lm(formula = Sepal.Length ~ Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16