How ANOVA analyze the variance

Author

TheRimalaya

Published

March 29, 2021

Modified

October 8, 2024

Analysis of Variance (ANOVA) is a powerful statistical tool used to analyze the variance among group means and determine whether these differences are statistically significant. It’s commonly used in various fields such as agriculture, biology, psychology, and more to test hypotheses about different groups. Below are some practical examples:

The effect of different diets on the growth of fishes
Comparing the height of three different species of a plant
The type of flour used for baking bread

Data for ANOVA could be collected through designed experiments or by sampling from populations. This article helps explain how ANOVA analyzes variance and identifies situations when these differences are significant, using both simulated and real data.

Often we Analysis of Variance (ANOVA) to analyze the variances to find if different cases results in similar outcome and if the difference is significant. Following are some simple examples,

Consider the following model with $i = 3$ groups and $j = n$ observations,

$y_{i j} = μ + τ_{i} + ε_{i j}, i = 1, 2, 3 and j = 1, 2, \dots n$

where, $τ_{i}$ represetns the effect corresponding to group $i$ and $ε_{i j} \sim N (0, σ^{2})$ , the usual assumption of linear model. In order to better understand how ANOVA finds the differences between groups and how the group mean and their standard deviation influence the results from ANOVA, we will explore the following four cases:

Simultion design

Design <- tidytable::bind_rows(
    Case1 = data.table(
        Group = paste("Group", 1:3, sep = ""),
        Mean = c(10, 10, 10),
        SD = c(5, 5, 5)
    ),
    Case2 = data.table(
        Group = paste("Group", 1:3, sep = ""),
        Mean = c(10, 10, 10),
        SD = c(1, 1, 1)
    ),
    Case3 = data.table(
        Group = paste("Group", 1:3, sep = ""),
        Mean = c(5, 10, 15),
        SD = c(5, 5, 5)
    ),
    Case4 = data.table(
        Group = paste("Group", 1:3, sep = ""),
        Mean = c(5, 10, 15),
        SD = c(1, 1, 1)
    ), .id = "Cases")

Case 1: Similar group means with high variation within the groups
Case 2: Similar group means with low variation within the groups
Case 3: Distant group means with high variation within the groups
Case 4: Distant group means with low variation within the groups

Cases	Group1		Group2		Group3
Cases	Mean	SD	Mean	SD	Mean	SD
Case1	10	5	10	5	10	5
Case2	10	1	10	1	10	1
Case3	5	5	10	5	15	5
Case4	5	1	10	1	15	1

Fitting ANOVA model for each cases

Simulate and fit ANOVA

generate_data <- function(mean, sd, nobs = 50) {
    Response <- rnorm(nobs, mean = mean, sd)
    tidytable(ID = 1:nobs, Response = Response)
}

Model <- Design %>% 
    mutate(Data = map2(Mean, SD, generate_data)) %>% 
    unnest() %>%
    nest(.by = "Cases") %>% 
    mutate(fit = map(data, function(dta) {
        lm(Response ~ Group, data = dta)
    }))

Model

# A tidytable: 4 × 3
  Cases data                  fit   
  <chr> <list>                <list>
1 Case1 <tidytable [150 × 5]> <lm>  
2 Case2 <tidytable [150 × 5]> <lm>  
3 Case3 <tidytable [150 × 5]> <lm>  
4 Case4 <tidytable [150 × 5]> <lm>

Distribution of data

Data distribution

Model[, map_df(fit, broom::augment), by = Cases] %>%
  ggplot(aes(Response, Group)) +
  geom_boxplot(
    aes(fill = Group, color = Group), 
    alpha = 0.25, width = 0.25
  ) +
  geom_point(aes(fill = Group),
    position = position_jitter(height = 0.1),
    shape = 21, size = 2, stroke = 0.25, alpha = 0.25
  ) +
  stat_summary(
    fun = mean, geom = "point", aes(color = Group),
    size = 2, shape = 21, fill = "whitesmoke", stroke = 0.75
  ) +
  facet_wrap(facets = vars(Cases), scales = "free_x") +
  ggridges::geom_density_ridges(
    aes(color = Group),
    fill = NA,
    panel_scaling = FALSE
  ) +
  scale_color_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1") +
  theme(
    legend.position = "none"
  )

Model comparison

ANOVA for the four cases

anova_result <- Model[, map_df(
  .x = fit,
  .f = ~ broom::tidy(anova(.x))
), by = Cases] %>% rename(
  DF = df,
  SSE = sumsq,
  MSE = meansq,
  Statistic = statistic,
  `p value` = p.value
)

anova_result %>% 
  mutate(
    Cases = case_when(
        Cases == "Case1" ~ "Case 1: Similar group means with high variation within the groups",
        Cases == "Case2" ~ "Case 2: Similar group means with low variation within the groups",
        Cases == "Case3" ~ "Case 3: Distant group means with high variation within the groups",
        TRUE ~ "Case 4: Distant group means with low variation within the groups"
    )
  ) %>% 
  gt::gt(groupname_col = "Cases", rowname_col = "term") %>%
  gt::fmt_number(columns = 4:6) %>%
  gt::fmt_number(columns = 7, decimals = 4) %>% 
  gt::sub_missing(missing_text = "") %>% 
  gt::tab_options(
    table.width = "100%",
    row_group.font.weight = "600"
  )

	DF	SSE	MSE	Statistic	p value
Case 1: Similar group means with high variation within the groups
Group	2	32.09	16.05	0.64	0.5304
Residuals	147	3,704.26	25.20
Case 2: Similar group means with low variation within the groups
Group	2	1.98	0.99	1.00	0.3714
Residuals	147	146.03	0.99
Case 3: Distant group means with high variation within the groups
Group	2	1,951.86	975.93	50.15	0.0000
Residuals	147	2,860.66	19.46
Case 4: Distant group means with low variation within the groups
Group	2	2,516.51	1,258.26	1,263.82	0.0000
Residuals	147	146.35	1.00

Interpretetion

The results show a high p-value, indicating no significant difference between the groups due to high within-group variability.

Here, the p-value is still high, suggesting no significant difference, but the small variance within groups provides clearer insights compared to Case 1.

Despite the high variation within groups, the distant group means lead to a low p-value, indicating statistically significant differences among the groups.

With low within-group variation and distant means, the p-value remains extremely low, strongly indicating significant group differences.

In conclusion, ANOVA helps determine if there are significant differences between multiple group means by comparing variances within groups to variances between groups. The power of ANOVA lies in its ability to detect even subtle differences when variations are minimal within groups.