Outliers and Influential Observations

Stat 203 Lecture 11


Dr. Janssen

Outliers and Influential Observations

To this point, we’ve been discussing tools for assessing overall model assumptions (what were they again?). We’re next going to explore ways of detecting problems with individual observations.


How could the problems of model assumption violations and problems with individual observations be related?


Outliers are observations inconsistent with the rest of the data set. We can locate them by identifying the corresponding residual as unusually large (positive or negative) using things like Q-Q plots or other tools we’ve discussed for assessming model assumptions.

However, a word of caution. The standardized residuals used in, say, a Q-Q plot, are computed using \(s^2\), which in turn is computed from the entire data set. An observation with a large raw residual is thus a part of the calculation of \(s^2\), and may be inflating the value of \(s^2\), making the outlier status harder to detect. This suggests that the right thing to do is that we should calculate \(s^2\) by first omitting Observation \(i\), and then compute the standardized residual for Observation \(i\). These residuals are called Studentized residuals1 and denoted by \(r_i''\).

The process of calculating Studentized residuals by hand is tedious, as a new estimate for the variance \(s^2\) needs to be calculated for each omission of an observation (you can read about the details on p. 109). This tedium can be circumvented via certain numerical identities, but we will just use the rstudent() function in R.

Example 1 For the lungcap data, the residual plot for Ht shows no outliers, but some large residuals:

library(GLMsData); data(lungcap);
lungcap$Smoke <- factor(lungcap$Smoke,
                         levels=c(0, 1),
LC.lm <- lm( FEV ~ Ht + Gender + Smoke, data=lungcap)
scatter.smooth (rstandard( LC.lm ) ~ lungcap$Ht, col="grey", las=1, ylab="Standardized residuals", xlab="Height (inches)")