Polynomial Trends, Regression Splines, and Fixing Problems

Stat 203 Lecture 14

Dr. Janssen

Polynomial Trends

Cautions

When we want to include higher powers of the covariate \(x\), the coefficients may be highly correlated.
Different combinations of correlated covariates may lead to nearly the same fitted values, creating difficulties in interpretation.

Example: `heatcap` data

library(GLMsData); data(heatcap)
data(heatcap)
plot (Cp ~ Temp, data=heatcap, main="Heat capacity versus temp", xlab="Temp (in Kelvin)", ylab="Heat capacity (cal/(mol.K))", las=1)

Example: `heatcap`

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation

Example: `heatcap`

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation

            (Intercept)       Temp  I(Temp^2)
(Intercept)   1.0000000 -0.9984975  0.9941781
Temp         -0.9984975  1.0000000 -0.9985344
I(Temp^2)     0.9941781 -0.9985344  1.0000000

Example: `heatcap`

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation

            (Intercept)       Temp  I(Temp^2)
(Intercept)   1.0000000 -0.9984975  0.9941781
Temp         -0.9984975  1.0000000 -0.9985344
I(Temp^2)     0.9941781 -0.9985344  1.0000000

hc_mod1 <- lm( Cp ~ poly(Temp, 1), data=heatcap)   # Linear
hc_mod2 <- lm( Cp ~ poly(Temp, 2), data=heatcap)   # Quadratic
hc_mod3 <- lm( Cp ~ poly(Temp, 3), data=heatcap)   # Cubic
hc_mod4 <- lm( Cp ~ poly(Temp, 4), data=heatcap)   # Quartic

Example: `heatcap`

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation

            (Intercept)       Temp  I(Temp^2)
(Intercept)   1.0000000 -0.9984975  0.9941781
Temp         -0.9984975  1.0000000 -0.9985344
I(Temp^2)     0.9941781 -0.9985344  1.0000000

hc_mod1 <- lm( Cp ~ poly(Temp, 1), data=heatcap)   # Linear
hc_mod2 <- lm( Cp ~ poly(Temp, 2), data=heatcap)   # Quadratic
hc_mod3 <- lm( Cp ~ poly(Temp, 3), data=heatcap)   # Cubic
hc_mod4 <- lm( Cp ~ poly(Temp, 4), data=heatcap)   # Quartic
zapsmall( summary(hc_mod3,correlation=TRUE)$correlation )

               (Intercept) poly(Temp, 3)1 poly(Temp, 3)2 poly(Temp, 3)3
(Intercept)              1              0              0              0
poly(Temp, 3)1           0              1              0              0
poly(Temp, 3)2           0              0              1              0
poly(Temp, 3)3           0              0              0              1

Regression Splines

The idea

A spline represents the relationship between \(y\) and \(x\) as a series of polynomials joined together at locations called knots, satisfying:

the transition across a knot is continuous
the first and second derivatives are also continuous at all knots

The number and degree of the polynomials can be chosen by the user.
Each spline is fit to a subset of the observations.
Fewer polynomials means a smoother curve and a simpler model.

Typical approach

Specify a convenient number of knots
Fit the spline curve to the data by least squares (this approach is called regression splines)
The number of regression coefficients used to fit a regression spline is known as the degrees of freedom of the curve; the higher the df, the more complex the trend the curve can follow.

In `R`

Using the splines package, splines may be fitted with

ns(): the natural cubic spline, with second derivatives forced to zero at the endpoints of the given interval
bs(): a B(asis)-spline

Three approaches

library(splines); library(GLMsData); data(heatcap)

lm.poly <- lm( Cp ~ poly(Temp, 3),  data=heatcap )
lm.ns   <- lm( Cp ~ ns(Temp, df=3), data=heatcap )
lm.bs   <- lm( Cp ~ bs(Temp, df=3), data=heatcap )

# Comparison:
extractAIC(lm.poly); extractAIC(lm.ns); extractAIC(lm.bs)

Three approaches

library(splines); library(GLMsData); data(heatcap)

lm.poly <- lm( Cp ~ poly(Temp, 3),  data=heatcap )
lm.ns   <- lm( Cp ~ ns(Temp, df=3), data=heatcap )
lm.bs   <- lm( Cp ~ bs(Temp, df=3), data=heatcap )

# Comparison:
extractAIC(lm.poly); extractAIC(lm.ns); extractAIC(lm.bs)

[1]    4.0000 -117.1234

[1]    4.0000 -119.2705

[1]    4.0000 -117.1234

Fixing Outliers

Possible Conclusions

The observation is a known mistake, due to a misapplication of treatment, misrecorded data, etc.
The observation is known to come from a different population.
We don’t know why the observation is an outlier.

Mistakes

If possible, correct the mistake!
Otherwise, discard
Assumes the occurrence of the mistake did not depend on the value of the observation

Different Population

Can often be discarded
Reporting should indicate the population to which the model applies
If it’s more than just one or two from the different population, perhaps the model can be modified to accommodate both populations

Unknown

Dilemma:

Discarding the observation is often unwise
Different/more complex model may be necessary
Strategy: fit the model with and without the outlier

When the cause of an outlier cannot be identified, the analyst is faced with a dilemma. Simply discarding the observation is often unwise, since that observation may be a real, genuine observation for which an alternative model would be appropriate. An outlier that is not a mistake suggests that a different or more complex model may be necessary. One strategy to evaluate the influence of the outlier is to fit the model to the data with and without the outlier. If the two models produce similar interpretations and conclusions for the researcher, then the outlier is unimportant, whether discarded or not. If the two models are materially different, perhaps other types of models should be considered. At the very least, note the observation and discuss the effect of the observation on the model.

Collinearity

Definition

Collinearity occurs when some of the covariates are highly correlated with one another
Implies that they measure almost the same information
Different combinations of the covariates may lead to nearly the same fitted values
Mainly a problem for interpretation
Most easily identified by examining correlations between covariates

Consequences of collinearity

Standard errors of affected regression coefficients become large.
We typically only need one to be retained in the model
Which one is retained basically doesn’t matter from a statistical perspective

Possible remedies

Omit some explanatory variables
Combine explanatory variables (providing the combination makes sense)
Collect more data
Use special methods

Omitting some explanatory variables from the analysis, since collinearity implies the explanatory variables contain almost the same information. Favour omitting explanatory variables with less theoretical basis for be- longing in the model, whose interpretation is less clear, or are harder to collect or measure. However, in practice, researchers tend to be reluctant to throw away data.

Combine explanatory variables in the model provided the combination makes sense. For example, if height and weight are highly correlated, consider combining the explanatory variables as the body mass index, or bmi, and use this explanatory variable in the model in place of height and weight. (bmi is weight (in kg), divided by the square of height (in m).)

Collect more data, if there are observations that can be made that better distinguish the correlated covariates. Sometimes the covariates are intrinsically correlated, so collinearity is difficult to remove regardless of data collection.

Use special methods, such as ridge regression, which are beyond the scope of this class.

Polynomial Trends, Regression Splines, and Fixing Problems

Polynomial Trends

Cautions

Example: heatcap data

Example: heatcap

Example: heatcap

Example: heatcap

Example: heatcap

Regression Splines

The idea

Typical approach

In R

Three approaches

Three approaches

Fixing Outliers

Possible Conclusions

Mistakes

Different Population

Unknown

Collinearity

Definition

Consequences of collinearity

Possible remedies

Example: `heatcap` data

Example: `heatcap`

Example: `heatcap`

Example: `heatcap`

Example: `heatcap`

In `R`