library(GLMsData); data(lungcap);
$Smoke <- factor(lungcap$Smoke,
lungcaplevels=c(0, 1),
labels=c("Non-smoker","Smoker"))
<- lm( FEV ~ Ht + Gender + Smoke, data=lungcap)
LC.lm scatter.smooth (rstandard( LC.lm ) ~ lungcap$Ht, col="grey", las=1, ylab="Standardized residuals", xlab="Height (inches)")
Outliers and Influential Observations
Stat 203 Lecture 11
Outliers and Influential Observations
To this point, we’ve been discussing tools for assessing overall model assumptions (what were they again?). We’re next going to explore ways of detecting problems with individual observations.
Outliers
Outliers are observations inconsistent with the rest of the data set. We can locate them by identifying the corresponding residual as unusually large (positive or negative) using things like Q-Q plots or other tools we’ve discussed for assessming model assumptions.
However, a word of caution. The standardized residuals used in, say, a Q-Q plot, are computed using \(s^2\), which in turn is computed from the entire data set. An observation with a large raw residual is thus a part of the calculation of \(s^2\), and may be inflating the value of \(s^2\), making the outlier status harder to detect. This suggests that the right thing to do is that we should calculate \(s^2\) by first omitting Observation \(i\), and then compute the standardized residual for Observation \(i\). These residuals are called Studentized residuals1 and denoted by \(r_i''\).
The process of calculating Studentized residuals by hand is tedious, as a new estimate for the variance \(s^2\) needs to be calculated for each omission of an observation (you can read about the details on p. 109). This tedium can be circumvented via certain numerical identities, but we will just use the rstudent()
function in R
.
Example 1 For the lungcap
data, the residual plot for Ht
shows no outliers, but some large residuals: