library(GLMsData); data(lungcap);
$Smoke <- factor(lungcap$Smoke,
lungcaplevels=c(0, 1),
labels=c("Non-smoker","Smoker"))
<- lm( FEV ~ Ht + Gender + Smoke, data=lungcap)
LC.lm scatter.smooth (rstandard( LC.lm ) ~ lungcap$Ht, col="grey", las=1, ylab="Standardized residuals", xlab="Height (inches)")
Outliers and Influential Observations
Stat 203 Lecture 11
Outliers and Influential Observations
To this point, we’ve been discussing tools for assessing overall model assumptions (what were they again?). We’re next going to explore ways of detecting problems with individual observations.
Outliers
Outliers are observations inconsistent with the rest of the data set. We can locate them by identifying the corresponding residual as unusually large (positive or negative) using things like Q-Q plots or other tools we’ve discussed for assessming model assumptions.
However, a word of caution. The standardized residuals used in, say, a Q-Q plot, are computed using \(s^2\), which in turn is computed from the entire data set. An observation with a large raw residual is thus a part of the calculation of \(s^2\), and may be inflating the value of \(s^2\), making the outlier status harder to detect. This suggests that the right thing to do is that we should calculate \(s^2\) by first omitting Observation \(i\), and then compute the standardized residual for Observation \(i\). These residuals are called Studentized residuals1 and denoted by \(r_i''\).
The process of calculating Studentized residuals by hand is tedious, as a new estimate for the variance \(s^2\) needs to be calculated for each omission of an observation (you can read about the details on p. 109). This tedium can be circumvented via certain numerical identities, but we will just use the rstudent()
function in R
.
Example 1 For the lungcap
data, the residual plot for Ht
shows no outliers, but some large residuals:
We can calculate the Studentized residuals (here done for observation 8) the easy way:
rstudent(LC.lm)[8]
8
1.243987
scatter.smooth(rstudent(LC.lm) ~ lungcap$Ht, col="grey", las=1, ylab="Studentized residuals", xlab="Height (inches)")
Influential Observations
An influential observation is an observation that substantially changes the fitted model when omitted from the data set. They necessarily have large residuals, but need not be outliers. Similarly, outliers need not be influential.
The typical way to measure influence for Observation \(i\) is to use Cook’s distance:
\[ D_i = \frac{(r_i')^2}{p'} \left(\frac{h_i}{1-h_i}\right). \] {#eqn-cook}
Larger values of \(D_i\) correspond to larger influence.
We can calculate Cook’s distance in R
via the function cooks.distance()
.
A reasonable question is: what values of Cook’s distance should be considered high?
- Cook’s distance \(D\) is approximately \(F\)-distributed with \((p',n-p')\) df.
- A conservative approach is then to use the 50th percentile of the \(F\)-distribution as a guideline (this is used by
R
). - For most \(F\)-distributions, this is near 1, so a useful rule of thumb is to flag observations with \(D > 1\) as potentially influential.
Other measures of influence
DFFITS
A measure of influence similar to Cook’s distance is DFFITS (“difference in fits”). DFFITS measures how much the fitted value of Observation \(i\) changes between the model fitted with all the data and the model fitted when Observation \(i\) is ommitted:
\[ \text{DFFITS}_i = \frac{\hat{\mu}_i - \hat{\mu}_{i(i)}}{s_{(i)}} = r_i'' \sqrt{\frac{h_i}{1-h_i}}. \]
We can calculate DFFITS in R
using dffits()
.
Similarly, we can calculate DFBETAS, a coefficient-specific version of DFFITS:
\[ \text{DFBETAS}_i = \frac{\hat{\beta}_j - \hat{\beta}_{j(i)}}{\text{se}(\hat{\beta}_{j(i)})}. \]
We can calculate DFBETAS in R
using dfbetas()
.
Finally, the covariance ratio measures the increase in uncertainty about the regression coefficients when Observation \(i\) is omitted:
\[ \text{CR} = \frac{1}{1-h} \left( \frac{n-p}{n-p' + (r'')^2}\right)^p, \]
where \(r''\) is the Studentized residual. We calculate the covariance ration in R
with the function covratio()
.
Alternatively, we can calculate a table of all these measures in R
using influence.measures()
. The following list provides criteria by which each measure may be considered influential. R
will helpfully flag an influential measure with a *
.
- DFBETAS: Observation \(i\) is declared influential when \(|\text{DFBETAS}_i| > 1\).
- DFFITS: Observation \(i\) is declared influential when \(|\text{DFFITS}_i| > 3/\sqrt{p'/(n-p')}\)]
- Covariance ratio CR: Observation \(i\) is declared influential when \(\text{CR}_i > 3p'/(n-p')\)
- Cook’s distance \(D\): Observation \(i\) is declared influential when \(D\) exceeds the 50th percentile of the \(F\) distribution with \((p',n-p')\) df
- Leverages \(h\): Observations are declared high leverage if \(h > 3p'/n\).
We close with a note that different criteria may declare different observations influential. The covariance ratio has a tendency to declare more observations as influential than the other criteria.
Some Examples in R
Example 2 We can identify the highest/lowest values of Cook’s distance:
<- which.max( cooks.distance(LC.lm)) # Largest D
cd.max <- which.min( cooks.distance(LC.lm)) # Smallest D
cd.min c(Min.Cook = cd.min, Max.Cook = cd.max)
Min.Cook.69 Max.Cook.613
69 613
<- cbind( DFFITS=dffits(LC.lm),
out Cooks.distance=cooks.distance(LC.lm),
Cov.ratio=covratio(LC.lm))
round( out[c(cd.min, cd.max),], 5) # Show the values for these obs only
DFFITS Cooks.distance Cov.ratio
69 0.00006 0.00000 1.01190
613 -0.39647 0.03881 0.96737
We see that Observation 613 is more influential than Observation 69 according to DFFITS and Cook’s distance (but not CV). We can use DFBETAS to examine the influence of these observations on the regression parameters.
dfbetas(LC.lm)[cd.min,] # Show DBETAS for cd.min
(Intercept) Ht GenderM SmokeSmoker
4.590976e-05 -3.974922e-05 -2.646158e-05 -1.041249e-06
dfbetas(LC.lm)[cd.max,] # Show DBETAS for cd.max
(Intercept) Ht GenderM SmokeSmoker
0.05430730 -0.06394615 0.10630441 -0.31682958
We collect the influence measures in a table.
<- influence.measures( LC.lm ); names(LC.im) LC.im
[1] "infmat" "is.inf" "call"
head( round(LC.im$infmat, 3) ) # Show for first few observations only
dfb.1_ dfb.Ht dfb.GndM dfb.SmkS dffit cov.r cook.d hat
1 0.117 -0.109 -0.024 0.015 0.127 1.012 0.004 0.013
2 -0.005 0.005 0.001 -0.001 -0.006 1.017 0.000 0.010
3 0.051 -0.047 -0.014 0.005 0.058 1.015 0.001 0.010
4 0.113 -0.104 -0.031 0.012 0.127 1.007 0.004 0.010
5 0.116 -0.106 -0.036 0.010 0.133 1.004 0.004 0.009
6 0.084 -0.077 -0.026 0.007 0.097 1.009 0.002 0.009
To determine how many observations are considered influential by each criterion, we use the $is.inf$ method and sum (
Rtreats
FALSEas 0 and
TRUE` as 1):
head( LC.im$is.inf )
dfb.1_ dfb.Ht dfb.GndM dfb.SmkS dffit cov.r cook.d hat
1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
colSums( LC.im$is.inf )
dfb.1_ dfb.Ht dfb.GndM dfb.SmkS dffit cov.r cook.d hat
0 0 0 0 18 56 0 7
You can read about other ways of exploring influence on pp. 114-115 of the text.
Footnotes
Though Studentized residuals are named in honor of William Sealy Gosset (aka Student), they were not discovered by him.↩︎