Polynomial Trends, Regression Splines, and Fixing Problems

Stat 203 Lecture 14

Dr. Janssen

Cautions

  • When we want to include higher powers of the covariate \(x\), the coefficients may be highly correlated.
  • Different combinations of correlated covariates may lead to nearly the same fitted values, creating difficulties in interpretation.

Example: heatcap data

library(GLMsData); data(heatcap)
data(heatcap)
plot (Cp ~ Temp, data=heatcap, main="Heat capacity versus temp", xlab="Temp (in Kelvin)", ylab="Heat capacity (cal/(mol.K))", las=1)

Example: heatcap

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation

Example: heatcap

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation
            (Intercept)       Temp  I(Temp^2)
(Intercept)   1.0000000 -0.9984975  0.9941781
Temp         -0.9984975  1.0000000 -0.9985344
I(Temp^2)     0.9941781 -0.9985344  1.0000000

Example: heatcap

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation
            (Intercept)       Temp  I(Temp^2)
(Intercept)   1.0000000 -0.9984975  0.9941781
Temp         -0.9984975  1.0000000 -0.9985344
I(Temp^2)     0.9941781 -0.9985344  1.0000000
hc_mod1 <- lm( Cp ~ poly(Temp, 1), data=heatcap)   # Linear
hc_mod2 <- lm( Cp ~ poly(Temp, 2), data=heatcap)   # Quadratic
hc_mod3 <- lm( Cp ~ poly(Temp, 3), data=heatcap)   # Cubic
hc_mod4 <- lm( Cp ~ poly(Temp, 4), data=heatcap)   # Quartic

Example: heatcap

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation
            (Intercept)       Temp  I(Temp^2)
(Intercept)   1.0000000 -0.9984975  0.9941781
Temp         -0.9984975  1.0000000 -0.9985344
I(Temp^2)     0.9941781 -0.9985344  1.0000000
hc_mod1 <- lm( Cp ~ poly(Temp, 1), data=heatcap)   # Linear
hc_mod2 <- lm( Cp ~ poly(Temp, 2), data=heatcap)   # Quadratic
hc_mod3 <- lm( Cp ~ poly(Temp, 3), data=heatcap)   # Cubic
hc_mod4 <- lm( Cp ~ poly(Temp, 4), data=heatcap)   # Quartic
zapsmall( summary(hc_mod3,correlation=TRUE)$correlation )
               (Intercept) poly(Temp, 3)1 poly(Temp, 3)2 poly(Temp, 3)3
(Intercept)              1              0              0              0
poly(Temp, 3)1           0              1              0              0
poly(Temp, 3)2           0              0              1              0
poly(Temp, 3)3           0              0              0              1

Regression Splines

The idea

A spline represents the relationship between \(y\) and \(x\) as a series of polynomials joined together at locations called knots, satisfying:

  • the transition across a knot is continuous
  • the first and second derivatives are also continuous at all knots
  • The number and degree of the polynomials can be chosen by the user.
  • Each spline is fit to a subset of the observations.
  • Fewer polynomials means a smoother curve and a simpler model.

Typical approach

  • Specify a convenient number of knots
  • Fit the spline curve to the data by least squares (this approach is called regression splines)
  • The number of regression coefficients used to fit a regression spline is known as the degrees of freedom of the curve; the higher the df, the more complex the trend the curve can follow.

In R

Using the splines package, splines may be fitted with

  • ns(): the natural cubic spline, with second derivatives forced to zero at the endpoints of the given interval
  • bs(): a B(asis)-spline

Three approaches

library(splines); library(GLMsData); data(heatcap)

lm.poly <- lm( Cp ~ poly(Temp, 3),  data=heatcap )
lm.ns   <- lm( Cp ~ ns(Temp, df=3), data=heatcap )
lm.bs   <- lm( Cp ~ bs(Temp, df=3), data=heatcap )

# Comparison:
extractAIC(lm.poly); extractAIC(lm.ns); extractAIC(lm.bs)

Three approaches

library(splines); library(GLMsData); data(heatcap)

lm.poly <- lm( Cp ~ poly(Temp, 3),  data=heatcap )
lm.ns   <- lm( Cp ~ ns(Temp, df=3), data=heatcap )
lm.bs   <- lm( Cp ~ bs(Temp, df=3), data=heatcap )

# Comparison:
extractAIC(lm.poly); extractAIC(lm.ns); extractAIC(lm.bs)
[1]    4.0000 -117.1234
[1]    4.0000 -119.2705
[1]    4.0000 -117.1234

Fixing Outliers

Possible Conclusions

  • The observation is a known mistake, due to a misapplication of treatment, misrecorded data, etc.
  • The observation is known to come from a different population.
  • We don’t know why the observation is an outlier.

Mistakes

  • If possible, correct the mistake!
  • Otherwise, discard
  • Assumes the occurrence of the mistake did not depend on the value of the observation

Different Population

  • Can often be discarded
  • Reporting should indicate the population to which the model applies
  • If it’s more than just one or two from the different population, perhaps the model can be modified to accommodate both populations

Unknown

Dilemma:

  • Discarding the observation is often unwise
  • Different/more complex model may be necessary
  • Strategy: fit the model with and without the outlier

Collinearity

Definition

  • Collinearity occurs when some of the covariates are highly correlated with one another
  • Implies that they measure almost the same information
  • Different combinations of the covariates may lead to nearly the same fitted values
  • Mainly a problem for interpretation
  • Most easily identified by examining correlations between covariates

Consequences of collinearity

  • Standard errors of affected regression coefficients become large.
  • We typically only need one to be retained in the model
  • Which one is retained basically doesn’t matter from a statistical perspective

Possible remedies

  • Omit some explanatory variables
  • Combine explanatory variables (providing the combination makes sense)
  • Collect more data
  • Use special methods