Testing Linearity with Residuals

Stat 203 Lecture 10

Author

Dr. Janssen

Residual Plots

Linearity: plot residuals against \(x_j\)

When considering the linearity assumption in an exploratory data analysis, we typically plot the response variable against each explanatory variable. But such a plot is not sophisticated enough to detect competing effects of multiple explanatory variables.

On the other hand, a plot of residuals against a covariate \(x_j\) can more easily detect deviations from linearity, as the linear effects of all explanatory variables have been removed. If the model fits well, the residuals should show no pattern. If a systematic trend exists in the residuals, we may need to transform the covariate or include extra terms in the model.

Example 1 We’ll use scatter.smooth() in place of plot() to add a smoothing curve and make it easier to see trends.

library(GLMsData); data(lungcap);
lungcap$Smoke <- factor(lungcap$Smoke,
                         levels=c(0, 1),
                         labels=c("Non-smoker","Smoker"))
LC.lm <- lm( FEV ~ Ht + Gender + Smoke, data=lungcap)
scatter.smooth (rstandard( LC.lm ) ~ lungcap$Ht, col="grey", las=1, ylab="Standardized residuals", xlab="Height (inches)")

These plots are slightly nonlinear, with increasing variance, suggesting a poor fit. (And of course, we don’t check linearity for gender or smoking status, as they are factors.)

Exploration

Choose one of your models (perhaps of the Term dataset) and evaluate the linearity assumption for the covariate(s) you included using the residual plots.

Partial Residual Plots

Another tool for exploring the linearity assumption for a given covariate \(x_j\) is to calculate the partial residual:

\[ u_j = r + \hat{\beta}_j x_j. \]

The partial residual plot is a plot of \(u_j\) against \(x_j\). The partial residual plot shows much the same information as the residual plot, but allows the statistician to assess the relative importance of the linearity by considering the magnitude of the trend.

We can also think of the partial residual plot as providing the same effect and simplicity of interpretation of the plot of \(y\) against \(x\) in simple linear regression, but in the context of multiple regression. We have adjusted the response for the other explanatory variables and are considering \(x_j\). The slope of the least-squares line fitted to the partial residual plot gives the coefficient for that explanatory variable in the full model. However, the variance of points around the line may be smaller than it actually is, because the residuals being plotted are from the full regression model with \(n-p'\) df rather than a simple linear regression with \(n-2\) df.

Example 2 We can calculate the partial residuals for each variable in our model for the lungcap dataset:

partial.resid <- resid( LC.lm, type="partial")
head(partial.resid)
          Ht     Gender       Smoke
1 -1.4958086  0.4026274  0.46481270
2 -1.7288086 -0.0897584 -0.02757306
3 -1.4658086  0.1732416  0.23542694
4 -1.1788086  0.4602416  0.52242694
5 -0.9908086  0.5185487  0.58073406
6 -1.1498086  0.3595487  0.42173406
termplot( LC.lm, partial.resid=TRUE, terms="Ht", las=1)

Exploration

Calculate some partial residuals and the accompanying plot to evaluate the linearity assumption for one or more variables in a model you’ve developed.

Residuals against \(\hat{\mu}\)

Plotting residuals against \(\hat{\mu}\) is used to check for constant variance. An increasing or decreasing trend in the variability of the residuals about the zero line suggests the need to transform or change the scale of the response variable to achieve constant variance. Standardized residuals are preferred, as they have have approximately constant variance if the model fits well.

Example 3 We can use the code below to plot the standardized residuals against the fitted values.

scatter.smooth( rstandard( LC.lm ) ~ fitted( LC.lm ), col="grey",
     las=1, ylab="Standardized residuals", xlab="Fitted values")

Q-Q plots for checking normality

The assumption of normality can be checked using a normal quantile-quantile, or Q-Q plot, of the residuals. In general, we can create Q-Q plots with any distribution, but in the case of multiple linear regression we use the normal distribution.

The idea is to compare the quantiles of the data agains the quantiles of the desired distribution. So, e.g., the value below which 20% of the data lie is plotted against the value below which 20% of a standard normal distribution lies. If the residuals have a standard normal distribution, the points will lie on a straight line.

The figures on p. 107 of the book give a sense of what different Q-Q plots can look like, and how to assess the behavior you’re seeing.

Since standardized residuals are more normally distributed than raw residuals, Q-Q plots are more appropriate and outliers easier to identify using standardized residuals.

Example 4 Here is a Q-Q plot of the standardized residuals of our lungcap model.

qqnorm( rstandard( LC.lm ), las=1, pch=19)
qqline( rstandard( LC.lm ) ) # reference line

Question

Given the discussion on p. 107, how would we interpret the assumption of normality?