By James Marquez, April 5, 2017
ANOVA is an ANalysis Of VAriance. It is a test to determine if there is a significant difference between the means of two or more populations. It describes the variance within groups and the variance between groups. It tests the null hypothesis which states that all population means are equal while the alternative hypothesis states that at least one is different. One-way ANOVA is used to test groups with only one response variable.
For this example, we're going to use a very popular dataset that is built into R and is used in a lot of machine learning examples. It's called the iris dataset and is a collection of flower samples each labeled with its flower species. There are four explanatory variables that describe each Species, which are Sepal Length, Sepal Width, Petal Length, and Petal Width. We're going to use this dataset as a toy example and use Sepal Width as our response variable (dependent variable) to test if there is a significant difference between the mean sepal widths of each species (independent variable) of iris flower.
The first step is to load the custom packages we will use to increase the functionality of base R.
library(ggplot2) # Functions used to create beautful plots library(mosaic) # Plot TukeyHSD
Next, let's load the iris dataset and explore it a few different ways to get a good understanding of the data we have available to us.
data('iris') # Load the iris dataset into local memory str(iris) # View the structure of the iris dataset table(iris$Species) # View the number of samples in each category of Species head(iris) # View the first six rows of the iris dataset
'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
setosa versicolor virginica 50 50 50
We can see that our dataset has 150 observations or samples. It has 5 variables, four continuous variables that explain each Species type. The Species variable consists of 3 factor levels (categories); setosa, versicolor, and virginica consisting of 50 samples each.
In our walk-through, we're going to look at the variances of only one variable, Sepal Width, which is called the response variable. We are going to run an ANOVA test on this variable to see if there are significant differences between the means of each flower type's sepal widths or if the differences may be caused by random chance.
Let's first look at a density plot of each species' Sepal Width to verify that each is approximately a normally distributed population, which is a requirement to perform an ANOVA test. We'll use ggplot2 to create our plot.
# Group, color, and fill by Species ggplot(iris, aes(x=Sepal.Width)) + geom_density(aes(group=Species, color=Species, fill=Species), alpha=0.3)
We can see that all three have approximately normally distributed populations so we can continue with our one-way ANOVA test.
Next, let's perform our ANOVA test. We'll use the aov() function and pass in our variables in the correct order. We'll save our results to an object we name ANOVA.
ANOVA <- aov(Sepal.Width ~ Species, data=iris) # (DV ~ IV, data=dataset) summary(ANOVA) # View results of the ANOVA test
Df Sum Sq Mean Sq F value Pr(>F) Species 2 11.35 5.672 49.16 <2e-16 *** Residuals 147 16.96 0.115 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We can see that the p-value is less than our alpha 0.05 significance level we've chosen, which means we reject the null hypothesis that the differences between the means are not statistically significant and instead accept the alternative hypothesis that the differences between at least one of the means is statistically significant. Our extremely low p-value means that there is a 0.00000000000002% chance we are wrong in our decision to reject the null hypothesis. Next, we'll run post-hoc tests, such as Tukey's Honest Significant Difference test, to determine which species have significantly different means.
Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = Sepal.Width ~ Species, data = iris) $Species diff lwr upr p adj versicolor-setosa -0.658 -0.81885528 -0.4971447 0.0000000 virginica-setosa -0.454 -0.61485528 -0.2931447 0.0000000 virginica-versicolor 0.204 0.04314472 0.3648553 0.0087802
Using an alpha of 0.05, we can see that the p adj value is less than our alpha in all three pairwise comparisons meaning there is a significant difference between all three species' means. We can use the mosaic package to visualize the results of TukeyHSD. We pass the TukeyHSD() function into the mplot() function.
mplot(TukeyHSD(ANOVA), system="ggplot") # Use the mosaic package to plot the results of TukeyHSD
If you prefer to do t-tests, you can use the following method to perform pairwise t-tests on all your factor levels. The pairwise.t.test() function allows you to choose between eight p-value adjustments to help counteract the problem of multiple comparisons: holm, hochberg, hommel, bonferroni, BH, BY, fdr, and none. To reduce the chance of incorrectly rejecting our null hypothesis (Type I error) we'll use the Bonferroni correction method when performing our multiple comparisons.
pairwise.t.test(iris$Sepal.Width, iris$Species, p.adj="bonferroni", paired=FALSE)
Pairwise comparisons using t tests with pooled SD data: iris$Sepal.Width and iris$Species setosa versicolor versicolor < 2e-16 - virginica 1.4e-09 0.0094 P value adjustment method: bonferroni
Our t-test comparisons show that all three species' means are significantly different because all p-values are less than our 0.05 alpha.