Outliers and Influential Observations

Stat 203 Lecture 11

Author

Dr. Janssen

Outliers and Influential Observations

To this point, we’ve been discussing tools for assessing overall model assumptions (what were they again?). We’re next going to explore ways of detecting problems with individual observations.

Question

How could the problems of model assumption violations and problems with individual observations be related?

Outliers

Outliers are observations inconsistent with the rest of the data set. We can locate them by identifying the corresponding residual as unusually large (positive or negative) using things like Q-Q plots or other tools we’ve discussed for assessming model assumptions.

However, a word of caution. The standardized residuals used in, say, a Q-Q plot, are computed using \(s^2\), which in turn is computed from the entire data set. An observation with a large raw residual is thus a part of the calculation of \(s^2\), and may be inflating the value of \(s^2\), making the outlier status harder to detect. This suggests that the right thing to do is that we should calculate \(s^2\) by first omitting Observation \(i\), and then compute the standardized residual for Observation \(i\). These residuals are called Studentized residuals1 and denoted by \(r_i''\).

The process of calculating Studentized residuals by hand is tedious, as a new estimate for the variance \(s^2\) needs to be calculated for each omission of an observation (you can read about the details on p. 109). This tedium can be circumvented via certain numerical identities, but we will just use the rstudent() function in R.

Example 1 For the lungcap data, the residual plot for Ht shows no outliers, but some large residuals:

library(GLMsData); data(lungcap);
lungcap$Smoke <- factor(lungcap$Smoke,
                         levels=c(0, 1),
                         labels=c("Non-smoker","Smoker"))
LC.lm <- lm( FEV ~ Ht + Gender + Smoke, data=lungcap)
scatter.smooth (rstandard( LC.lm ) ~ lungcap$Ht, col="grey", las=1, ylab="Standardized residuals", xlab="Height (inches)")

We can calculate the Studentized residuals (here done for observation 8) the easy way:

rstudent(LC.lm)[8]
       8 
1.243987 
scatter.smooth(rstudent(LC.lm) ~ lungcap$Ht, col="grey", las=1, ylab="Studentized residuals", xlab="Height (inches)")

Exploration

Calculate the Studentized residuals for one of your models. What do you notice? What do you wonder?

Influential Observations

An influential observation is an observation that substantially changes the fitted model when omitted from the data set. They necessarily have large residuals, but need not be outliers. Similarly, outliers need not be influential.

The typical way to measure influence for Observation \(i\) is to use Cook’s distance:

\[ D_i = \frac{(r_i')^2}{p'} \left(\frac{h_i}{1-h_i}\right). \] {#eqn-cook}

Larger values of \(D_i\) correspond to larger influence.

Question

Describe the features of a highly influential observation.

We can calculate Cook’s distance in R via the function cooks.distance().

A reasonable question is: what values of Cook’s distance should be considered high?

  • Cook’s distance \(D\) is approximately \(F\)-distributed with \((p',n-p')\) df.
  • A conservative approach is then to use the 50th percentile of the \(F\)-distribution as a guideline (this is used by R).
  • For most \(F\)-distributions, this is near 1, so a useful rule of thumb is to flag observations with \(D > 1\) as potentially influential.

Other measures of influence

DFFITS

A measure of influence similar to Cook’s distance is DFFITS (“difference in fits”). DFFITS measures how much the fitted value of Observation \(i\) changes between the model fitted with all the data and the model fitted when Observation \(i\) is ommitted:

\[ \text{DFFITS}_i = \frac{\hat{\mu}_i - \hat{\mu}_{i(i)}}{s_{(i)}} = r_i'' \sqrt{\frac{h_i}{1-h_i}}. \]

We can calculate DFFITS in R using dffits().

Similarly, we can calculate DFBETAS, a coefficient-specific version of DFFITS:

\[ \text{DFBETAS}_i = \frac{\hat{\beta}_j - \hat{\beta}_{j(i)}}{\text{se}(\hat{\beta}_{j(i)})}. \]

We can calculate DFBETAS in R using dfbetas().

Finally, the covariance ratio measures the increase in uncertainty about the regression coefficients when Observation \(i\) is omitted:

\[ \text{CR} = \frac{1}{1-h} \left( \frac{n-p}{n-p' + (r'')^2}\right)^p, \]

where \(r''\) is the Studentized residual. We calculate the covariance ration in R with the function covratio().

Alternatively, we can calculate a table of all these measures in R using influence.measures(). The following list provides criteria by which each measure may be considered influential. R will helpfully flag an influential measure with a *.

  • DFBETAS: Observation \(i\) is declared influential when \(|\text{DFBETAS}_i| > 1\).
  • DFFITS: Observation \(i\) is declared influential when \(|\text{DFFITS}_i| > 3/\sqrt{p'/(n-p')}\)]
  • Covariance ratio CR: Observation \(i\) is declared influential when \(\text{CR}_i > 3p'/(n-p')\)
  • Cook’s distance \(D\): Observation \(i\) is declared influential when \(D\) exceeds the 50th percentile of the \(F\) distribution with \((p',n-p')\) df
  • Leverages \(h\): Observations are declared high leverage if \(h > 3p'/n\).

We close with a note that different criteria may declare different observations influential. The covariance ratio has a tendency to declare more observations as influential than the other criteria.

Some Examples in R

Example 2 We can identify the highest/lowest values of Cook’s distance:

cd.max <- which.max( cooks.distance(LC.lm)) # Largest D
cd.min <- which.min( cooks.distance(LC.lm))   # Smallest D
c(Min.Cook = cd.min, Max.Cook = cd.max)
 Min.Cook.69 Max.Cook.613 
          69          613 
out <- cbind( DFFITS=dffits(LC.lm),
                Cooks.distance=cooks.distance(LC.lm),
                Cov.ratio=covratio(LC.lm))

round( out[c(cd.min, cd.max),], 5)  # Show the values for these obs only
      DFFITS Cooks.distance Cov.ratio
69   0.00006        0.00000   1.01190
613 -0.39647        0.03881   0.96737

We see that Observation 613 is more influential than Observation 69 according to DFFITS and Cook’s distance (but not CV). We can use DFBETAS to examine the influence of these observations on the regression parameters.

dfbetas(LC.lm)[cd.min,]  # Show DBETAS for cd.min
  (Intercept)            Ht       GenderM   SmokeSmoker 
 4.590976e-05 -3.974922e-05 -2.646158e-05 -1.041249e-06 
dfbetas(LC.lm)[cd.max,]  # Show DBETAS for cd.max
(Intercept)          Ht     GenderM SmokeSmoker 
 0.05430730 -0.06394615  0.10630441 -0.31682958 
Question

How should we interpret this output?

We collect the influence measures in a table.

LC.im <- influence.measures( LC.lm ); names(LC.im)
[1] "infmat" "is.inf" "call"  
head( round(LC.im$infmat, 3) )  # Show for first few observations only
  dfb.1_ dfb.Ht dfb.GndM dfb.SmkS  dffit cov.r cook.d   hat
1  0.117 -0.109   -0.024    0.015  0.127 1.012  0.004 0.013
2 -0.005  0.005    0.001   -0.001 -0.006 1.017  0.000 0.010
3  0.051 -0.047   -0.014    0.005  0.058 1.015  0.001 0.010
4  0.113 -0.104   -0.031    0.012  0.127 1.007  0.004 0.010
5  0.116 -0.106   -0.036    0.010  0.133 1.004  0.004 0.009
6  0.084 -0.077   -0.026    0.007  0.097 1.009  0.002 0.009

To determine how many observations are considered influential by each criterion, we use the $is.inf$ method and sum (RtreatsFALSEas 0 andTRUE` as 1):

head( LC.im$is.inf )
  dfb.1_ dfb.Ht dfb.GndM dfb.SmkS dffit cov.r cook.d   hat
1  FALSE  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
2  FALSE  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
3  FALSE  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
4  FALSE  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
5  FALSE  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
6  FALSE  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
colSums( LC.im$is.inf )
  dfb.1_   dfb.Ht dfb.GndM dfb.SmkS    dffit    cov.r   cook.d      hat 
       0        0        0        0       18       56        0        7 

You can read about other ways of exploring influence on pp. 114-115 of the text.

Exploration

Explore the Term dataset and identify the top and bottom 2 observations in terms of influence.

Footnotes

  1. Though Studentized residuals are named in honor of William Sealy Gosset (aka Student), they were not discovered by him.↩︎