```
library(GLMsData); data(lungcap);
$Smoke <- factor(lungcap$Smoke,
lungcaplevels=c(0, 1),
labels=c("Non-smoker","Smoker"))
<- lm( FEV ~ Ht + Gender + Smoke, data=lungcap)
LC.lm scatter.smooth (rstandard( LC.lm ) ~ lungcap$Ht, col="grey", las=1, ylab="Standardized residuals", xlab="Height (inches)")
```

# Outliers and Influential Observations

Stat 203 Lecture 11

# Outliers and Influential Observations

To this point, we’ve been discussing tools for assessing overall model assumptions (what were they again?). We’re next going to explore ways of detecting problems with individual observations.

## Outliers

**Outliers** are observations inconsistent with the rest of the data set. We can locate them by identifying the corresponding residual as unusually large (positive or negative) using things like Q-Q plots or other tools we’ve discussed for assessming model assumptions.

However, a word of caution. The standardized residuals used in, say, a Q-Q plot, are computed using \(s^2\), which in turn is computed from the entire data set. An observation with a large raw residual is thus a part of the calculation of \(s^2\), and may be inflating the value of \(s^2\), *making the outlier status harder to detect*. This suggests that the right thing to do is that we should calculate \(s^2\) by first omitting Observation \(i\), and then compute the standardized residual for Observation \(i\). These residuals are called *Studentized residuals*^{1} and denoted by \(r_i''\).

The process of calculating Studentized residuals by hand is tedious, as a new estimate for the variance \(s^2\) needs to be calculated for each omission of an observation (you can read about the details on p. 109). This tedium can be circumvented via certain numerical identities, but we will just use the `rstudent()`

function in `R`

.

**Example 1 **For the `lungcap`

data, the residual plot for `Ht`

shows no outliers, but some large residuals: