Transforming Covariates

Stat 203 Lecture 13

Dr. Janssen

Transforming Covariates

Transforming Factors

Question

Why does it not make sense to transform factors?

Transforming Covariates

There are at least a couple of reasons we may consider transforming covariates:

  • Perhaps the trend in the covariates is not particularly linear. A transformation may enable us to achieve linearity.
  • There may be some influential observations in the data. Transforming the covariates may reduce their influence.

When transforming the covariates, we still produce a model linear in the parameters, and we may transform any or all of the covariates as appropriate to the situation.

Example: windmill

Example 1 In this example, we explore a dataset relating wind velocity and corresponding direct current output from windmills. We first import the data.

library(GLMsData); data(windmill); names(windmill)
[1] "Wind" "DC"  

Example: windmill

Example 2 In this example, we explore a dataset relating wind velocity and corresponding direct current output from windmills. We first import the data.

library(GLMsData); data(windmill); names(windmill)
[1] "Wind" "DC"  
scatter.smooth(windmill$DC ~ windmill$Wind, main="No transforms", xlab="Wind speed", ylab="DC output", las=1)

Explore. How can you adjust Wind to produce something more linear?

Fixing windmill

Code
scatter.smooth(windmill$DC ~ (1/windmill$Wind), main="1/Wind", xlab="1/(Wind speed)", ylab="DC output", las=1)

Warning

To tell R to interpret 1/Wind arithmetically rather than as a formula in the lm() function, we need to insulate it using the I() function:

wind_model <- lm(DC ~ I(1/Wind), data=windmill)
summary(wind_model)

Call:
lm(formula = DC ~ I(1/Wind), data = windmill)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.20547 -0.04940  0.01100  0.08352  0.12204 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.9789     0.0449   66.34   <2e-16 ***
I(1/Wind)    -6.9345     0.2064  -33.59   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09417 on 23 degrees of freedom
Multiple R-squared:   0.98, Adjusted R-squared:  0.9792 
F-statistic:  1128 on 1 and 23 DF,  p-value: < 2.2e-16

Example: Simultaneous log transforms

Consider the trees dataset.

Code
data(trees)
plot(log(Volume)~log(Height), data=trees, pch=19, xlab="log(Height)", ylab="log(Volume)", main="Log(Volume) vs log(Height)", las=1)

Code
plot(log(Volume)~log(Girth), data=trees, pch=19, xlab="log(Girth)", ylab="log(Volume)", main="Log(Volume) vs log(Girth)", las=1)

Code
tree_model <- lm( log(Volume) ~ log(Girth) + log(Height), data=trees)
printCoefmat( coef(summary(tree_model)))
             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -6.631617   0.799790 -8.2917 5.057e-09 ***
log(Girth)   1.982650   0.075011 26.4316 < 2.2e-16 ***
log(Height)  1.117123   0.204437  5.4644 7.805e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Cautions

  • When we want to include higher powers of the covariate \(x\), the coefficients may be highly correlated.
  • Different combinations of correlated covariates may lead to nearly the same fitted values, creating difficulties in interpretation.

Example: heatcap data

data(heatcap)
plot (Cp ~ Temp, data=heatcap, main="Heat capacity versus temp", xlab="Temp (in Kelvin)", ylab="Heat capacity (cal/(mol.K))", las=1)

Example: heatcap

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation

Example: heatcap

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation
            (Intercept)       Temp  I(Temp^2)
(Intercept)   1.0000000 -0.9984975  0.9941781
Temp         -0.9984975  1.0000000 -0.9985344
I(Temp^2)     0.9941781 -0.9985344  1.0000000

Example: heatcap

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation
            (Intercept)       Temp  I(Temp^2)
(Intercept)   1.0000000 -0.9984975  0.9941781
Temp         -0.9984975  1.0000000 -0.9985344
I(Temp^2)     0.9941781 -0.9985344  1.0000000
hc_mod1 <- lm( Cp ~ poly(Temp, 1), data=heatcap)   # Linear
hc_mod2 <- lm( Cp ~ poly(Temp, 2), data=heatcap)   # Quadratic
hc_mod3 <- lm( Cp ~ poly(Temp, 3), data=heatcap)   # Cubic
hc_mod4 <- lm( Cp ~ poly(Temp, 4), data=heatcap)   # Quartic

Example: heatcap

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation
            (Intercept)       Temp  I(Temp^2)
(Intercept)   1.0000000 -0.9984975  0.9941781
Temp         -0.9984975  1.0000000 -0.9985344
I(Temp^2)     0.9941781 -0.9985344  1.0000000
hc_mod1 <- lm( Cp ~ poly(Temp, 1), data=heatcap)   # Linear
hc_mod2 <- lm( Cp ~ poly(Temp, 2), data=heatcap)   # Quadratic
hc_mod3 <- lm( Cp ~ poly(Temp, 3), data=heatcap)   # Cubic
hc_mod4 <- lm( Cp ~ poly(Temp, 4), data=heatcap)   # Quartic
zapsmall( summary(hc_mod3,correlation=TRUE)$correlation )
               (Intercept) poly(Temp, 3)1 poly(Temp, 3)2 poly(Temp, 3)3
(Intercept)              1              0              0              0
poly(Temp, 3)1           0              1              0              0
poly(Temp, 3)2           0              0              1              0
poly(Temp, 3)3           0              0              0              1