linear models in practice
Posted on March 23, 2016
Linear models in practice
The mouse diet example
We will demonstrate how to analyze the high fat diet data using linear models instead of directly applying a t-test. We will demonstrate how ultimately these two approaches are equivalent.
We start by reading in the data and creating a quick stripchart:
url <- "http://databeauty.com/data/femaleMiceWeights.csv"
filename <- "femaleMiceWeights.csv"
library(downloader)
if (!file.exists(filename)) download(url, filename)
set.seed(1) #same jitter in stripchart
dat <- read.csv("femaleMiceWeights.csv") ##previously downloaded
stripchart(dat$Bodyweight ~ dat$Diet, vertical=TRUE, method="jitter",
main="Bodyweight over Diet")
We can see that the high fat diet group appears to have higher weights on average, although there is overlap between the two samples.
For demonstration purposes, we will build the design matrix $\mathbf{X}$ using the formula ~ Diet
. The group with the 1’s in the second column is determined by the level of Diet
which comes second; that is, the non-reference level.
levels(dat$Diet)
X <- model.matrix(~ Diet, data=dat)
head(X)
The Mathematics Behind lm()
Before we use our shortcut for running linear models, lm
, we want to review what will happen internally. Inside of lm
, we will form the design matrix $\mathbf{X}$ and calculate the $\boldsymbol{\beta}$, which minimizes the sum of squares using the previously described formula. The formula for this solution is:
We can calculate this in R using our matrix multiplication operator %*%
, the inverse function solve
, and the transpose function t
.
Y <- dat$Bodyweight
X <- model.matrix(~ Diet, data=dat)
solve(t(X) %*% X) %*% t(X) %*% Y
These coefficients are the average of the control group and the difference of the averages:
s <- split(dat$Bodyweight, dat$Diet)
mean(s[["chow"]])
mean(s[["hf"]]) - mean(s[["chow"]])
Finally, we use our shortcut, lm
, to run the linear model:
fit <- lm(Bodyweight ~ Diet, data=dat)
summary(fit)
(coefs <- coef(fit))
Examining the coefficients
The following plot provides a visualization of the meaning of the coefficients with colored arrows (code not shown):
stripchart(dat$Bodyweight ~ dat$Diet, vertical=TRUE, method="jitter",
main="Bodyweight over Diet", ylim=c(0,40), xlim=c(0,3))
a <- -0.25
lgth <- .1
library(RColorBrewer)
cols <- brewer.pal(3,"Dark2")
abline(h=0)
arrows(1+a,0,1+a,coefs[1],lwd=3,col=cols[1],length=lgth)
abline(h=coefs[1],col=cols[1])
arrows(2+a,coefs[1],2+a,coefs[1]+coefs[2],lwd=3,col=cols[2],length=lgth)
abline(h=coefs[1]+coefs[2],col=cols[2])
legend("right",names(coefs),fill=cols,cex=.75,bg="white")
To make a connection with material presented earlier, this simple linear model is actually giving us the same result (the t-statistic and p-value) for the difference as a specific kind of t-test. This is the t-test between two groups with the assumption that the population standard deviation is the same for both groups. This was encoded into our linear model when we assumed that the errors $\boldsymbol{\varepsilon}$ were all equally distributed.
Although in this case the linear model is equivalent to a t-test, we will soon explore more complicated designs, where the linear model is a useful extension. Below we demonstrate that one does in fact get the exact same results:
Our lm
estimates were:
summary(fit)$coefficients
And the t-statistic is the same:
ttest <- t.test(s[["hf"]], s[["chow"]], var.equal=TRUE)
summary(fit)$coefficients[2,3]
ttest$statistic