I must not fear ANOVA.
Fear is the mind-killer.
Fear is the little-death that brings total obliteration.
I will face my fear.
I will permit it to pass over me and through me.
And when it has gone past, I will turn the inner eye to see its
path.
Where the ANOVA has gone, there will be nothing. Only least-squares
regression will remain.
Whenever we employ linear modeling techniques, we make the extremely convenient assumption that we can express these models using high school algebra:
\[\begin{equation} \tag{first style} f(x) = ax + b \end{equation}\]
If this comes as a surprise to you, it may be because we often use a different notation for linear modeling. This is deeply confusing. The above, hopefully familiar model would be expressed like this:
\[\begin{equation} \tag{second style} y \sim \beta_0 + \beta_1 x \end{equation}\]
Don’t be fooled, dear reader, the meaning behind these formulas is identical. For our purposes, \(f(x)\) is just another way of saying \(y\). The tilde (\(\sim\)) should be read as is explained by, a more nuanced version of is equal to (\(=\)). Finally, we have \(\beta_0\) and \(\beta_1\), which correspond to \(b\) and \(a\) in the first formula. Note that the order is swapped between the two formulas. The reason for this is convention: With the first notation, we use letters to mark variables, usually from highest to lowest power of \(x\). So:
\[\begin{equation} \tag{first style} f(x) = ax^3 + bx^2 + cx + d \end{equation}\]
In the second notation, we still use letters, but they’re all \(\beta\). To tell our \(\beta\)s apart we label them at their bottom-right, starting with \(\beta_0\). Further, we typically go from lowest to highest order. So:
\[\begin{equation} \tag{second style} y \sim \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 \end{equation}\]
Let’s start by taking a look at our data. We have 20 observations of \(y\), with complementary \(x\) values. In this particular case the \(x\)-values happen to be the integers from 1 to 20.
At this point, your instincts are hopefully screaming at you about a
positive association between \(y\) and
\(x\).
In order to determine whether a model fits sufficiently well, perhaps even significantly well, we have to compare it to some sort of baseline model. By convention, we refer to the baseline model as the null-model. By constructing these two models, we have implicitly proposed two hypothetical scenarios, two hypotheses, which are conventionally referred to as H0 and H1 for the null-hypothesis and often more interesting non-null, ‘actual’ hypothesis.
In this example, we expect there is an association between \(y\) and \(x\). In other words, we hypothesize that the ‘actual’ model is some sort of diagonal, straight line. In the section above, we discussed that the formula for such a line is:
\[\begin{equation} \tag{$H_1$ model} y \sim \beta_0 + \beta_1x \end{equation}\]
Convince yourself that \(\beta_1x\), which corresponds to \(ax\) using the first notation style, is the only part of this model that can be affected by \(x\). Further, recognize that \(\beta_1\) (or \(a\)), controls the slope of the straight line that is defined by this model. Now that we’ve defined our \(H_1\), let’s think about the null-model and corresponding implicit null-hypothesis, \(H_0\).
If our \(H_1\) is that there is an association between \(y\) and \(x\), is seems fair that our \(H_0\) should be that there is no association between \(y\) and \(x\). In that case, we should expect that our model would perform just fine without any mention of \(x\). So, let’s simply kick out \(\beta_1x\). What’s left will be our \(H_0\).
\[\begin{equation} \tag{$H_0$ model} y \sim \beta_0 \end{equation}\]
Recall that \(\beta_0\), also known as \(b\) in the first notation style, is a ‘regular’, constant value that controls the \(y\)-values (height) of our line.
Before we go ahead and plot our models, we’ll note down the number of parameters used for each of our models. \(H_1\) uses two (namely \(\beta_0\) and \(\beta_1\)), whereas \(H_0\) only uses one (namely, \(\beta_0\)). We’ll need this information later. With all this in mind, let’s take a look at the two straight lines our models have drawn through our data.
Neither model perfectly explains the data, as not all data points lie precisely on the line. We can measure how much discrepancy there is between the points and the the line. We call these discrepancies, or errors, that remain after having fitted the model the residuals. Note that some of the residuals are negative, while others are positive. Because positive and negative residuals are equally bad, we’ll make them both positive. In the case of least-squares regression we square the values to achieve this, as the name implies.
Next, we want to summarize the total amount of error for both of our models. To achieve this, we’ll simply add (or sum) all the squared residual errors together for both models. You may have seen the term ‘sum-of-squares’ before. This is exactly what’s going on here. We’ll call the two resulting residual sums of squares \(RSS_0\) and \(RSS_1\), for \(H_0\) and \(H_1\), respectively.
Now that we have calculated the residual sums of squares (RSS) for
both models, we can start thinking about which model we find more
likely. Often, we are interested in estimating a p-value, in
other words, an estimation of how likely our observed data would be to
happen, assuming that the null-hypothesis is true. We can use the F
distribution to calculate our p-value. We can think about the
F-distribution as a set of shapes defined by a formula, a bit like how
the formula \(f(x) = ax^2 + bx + c\)
defines a set of shapes, namely parabolas. In order to figure out which
particular shape and corresponding p-value we have, we need to know the
value of three parameters, named \(F\),
\(df_1\) and \(df_2\). R then provides a wonderfully
convenient function, pf()
, that does most of the heavy
lifting for us. What’s more, we now have all the necessary information
to find the values of these three parameters. The formulas for our three
variables are as follows:
\[\begin{equation} \tag{The F-distribution} F = \frac{RSS_0 - RSS_1}{RSS_1}\times \frac{df_2}{df_1} ,\\ df_1 = p_1 - p_0 , df_2 = n - p_1 \end{equation}\]
Where \(p_0\) is the number of parameters in \(H_0\) (one), \(P_1\) is the number of parameters in \(H_1\) (two) and \(n\) is, well, the n-number (20). Let’s plug that into our formula:
\[\begin{equation} \tag{Our F-value} F = \frac{35.922 - 16.192}{16.192}\times \frac{2 - 1}{20 - 2} = 21.93 \end{equation}\]
So, for our example, \(F = 21.93\),
\(df_1 = 1\) and \(df_2 = 18\). let’s plug that into
pf()
to get our p-value.
pf(21.93, 1, 18, lower.tail = FALSE)
## [1] 0.0001852457
Let’s compare our values to those we get when fitting an ANOVA in R the regular way:
lm(y ~ x) |> anova()
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 19.731 19.7306 21.934 0.0001851 ***
## Residuals 18 16.192 0.8995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Great! Our values correspond to those in Df
,
F value
, and Pr(>F)
, up to rounding error.
We can also spot \(RSS_1\) in the
Sum Sq
column, but \(RSS_0\) is not found immediately. However,
when we sum together the two values in the Sum Sq
column,
we do indeed get 35.922. Wonderful!