Transforming Covariates

Stat 203 Lecture 13

Dr. Janssen

Transforming Covariates

Transforming Factors

Question

Why does it not make sense to transform factors?

Transforming Covariates

There are at least a couple of reasons we may consider transforming covariates:

Perhaps the trend in the covariates is not particularly linear. A transformation may enable us to achieve linearity.
There may be some influential observations in the data. Transforming the covariates may reduce their influence.

When transforming the covariates, we still produce a model linear in the parameters, and we may transform any or all of the covariates as appropriate to the situation.

Example: `windmill`

Example 1 In this example, we explore a dataset relating wind velocity and corresponding direct current output from windmills. We first import the data.

library(GLMsData); data(windmill); names(windmill)

[1] "Wind" "DC"

Example: `windmill`

Example 2 In this example, we explore a dataset relating wind velocity and corresponding direct current output from windmills. We first import the data.

library(GLMsData); data(windmill); names(windmill)

[1] "Wind" "DC"

scatter.smooth(windmill$DC ~ windmill$Wind, main="No transforms", xlab="Wind speed", ylab="DC output", las=1)

Explore. How can you adjust Wind to produce something more linear?

Fixing `windmill`

Code

scatter.smooth(windmill$DC ~ (1/windmill$Wind), main="1/Wind", xlab="1/(Wind speed)", ylab="DC output", las=1)

Warning

To tell R to interpret 1/Wind arithmetically rather than as a formula in the lm() function, we need to insulate it using the I() function:

wind_model <- lm(DC ~ I(1/Wind), data=windmill)
summary(wind_model)


Call:
lm(formula = DC ~ I(1/Wind), data = windmill)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.20547 -0.04940  0.01100  0.08352  0.12204 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.9789     0.0449   66.34   <2e-16 ***
I(1/Wind)    -6.9345     0.2064  -33.59   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09417 on 23 degrees of freedom
Multiple R-squared:   0.98, Adjusted R-squared:  0.9792 
F-statistic:  1128 on 1 and 23 DF,  p-value: < 2.2e-16

Example: Simultaneous log transforms

Consider the trees dataset.

Code

data(trees)
plot(log(Volume)~log(Height), data=trees, pch=19, xlab="log(Height)", ylab="log(Volume)", main="Log(Volume) vs log(Height)", las=1)

Code

plot(log(Volume)~log(Girth), data=trees, pch=19, xlab="log(Girth)", ylab="log(Volume)", main="Log(Volume) vs log(Girth)", las=1)

Code

tree_model <- lm( log(Volume) ~ log(Girth) + log(Height), data=trees)
printCoefmat( coef(summary(tree_model)))

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -6.631617   0.799790 -8.2917 5.057e-09 ***
log(Girth)   1.982650   0.075011 26.4316 < 2.2e-16 ***
log(Height)  1.117123   0.204437  5.4644 7.805e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Sometimes it makes sense to take the log of both the response \(y\) and the covariate(s) \(x\). This is particularly the case when \(y\) is proportional to a power of \(x\); if \(E[y] = \alpha x^\beta\), then \(E[\log y] \approx \log(\alpha) + \beta \log(x)\).

This can often happen when the units of the response are a power of the units of the covariates, as in the trees datasets, in which Volume (in cubic feet) is given, as are height (in feet) and Girth (DBH) (in inches). Note that the formula for the volume of a cylindrical black cherry tree is

\[ y = \frac{\pi (d/12)^2 h}{4}. \]

Thus, \(\log y = \log(\pi/576) + 2\log d + \log h\).

We note that the coefficient on \(\log(d)\) is approx 2, and the coefficient on \(\log(h)\) is approx 1. The intercept is not approximately \(\log(\pi/576) \approx -5.211\), though; it is approximately \(\log(\pi/1728)\), which might come from the conical approximation: \(y = \frac{\pi (d/12)^2 h}{12}\).

Polynomial Trends

Cautions

When we want to include higher powers of the covariate \(x\), the coefficients may be highly correlated.
Different combinations of correlated covariates may lead to nearly the same fitted values, creating difficulties in interpretation.

Example: `heatcap` data

data(heatcap)
plot (Cp ~ Temp, data=heatcap, main="Heat capacity versus temp", xlab="Temp (in Kelvin)", ylab="Heat capacity (cal/(mol.K))", las=1)

Example: `heatcap`

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation

Example: `heatcap`

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation

            (Intercept)       Temp  I(Temp^2)
(Intercept)   1.0000000 -0.9984975  0.9941781
Temp         -0.9984975  1.0000000 -0.9985344
I(Temp^2)     0.9941781 -0.9985344  1.0000000

Example: `heatcap`

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation

            (Intercept)       Temp  I(Temp^2)
(Intercept)   1.0000000 -0.9984975  0.9941781
Temp         -0.9984975  1.0000000 -0.9985344
I(Temp^2)     0.9941781 -0.9985344  1.0000000

hc_mod1 <- lm( Cp ~ poly(Temp, 1), data=heatcap)   # Linear
hc_mod2 <- lm( Cp ~ poly(Temp, 2), data=heatcap)   # Quadratic
hc_mod3 <- lm( Cp ~ poly(Temp, 3), data=heatcap)   # Cubic
hc_mod4 <- lm( Cp ~ poly(Temp, 4), data=heatcap)   # Quartic

Example: `heatcap`

hc_quad <- lm(Cp ~ Temp + I(Temp^2),data=heatcap)
summary(hc_quad, correlation=TRUE)$correlation

            (Intercept)       Temp  I(Temp^2)
(Intercept)   1.0000000 -0.9984975  0.9941781
Temp         -0.9984975  1.0000000 -0.9985344
I(Temp^2)     0.9941781 -0.9985344  1.0000000

hc_mod1 <- lm( Cp ~ poly(Temp, 1), data=heatcap)   # Linear
hc_mod2 <- lm( Cp ~ poly(Temp, 2), data=heatcap)   # Quadratic
hc_mod3 <- lm( Cp ~ poly(Temp, 3), data=heatcap)   # Cubic
hc_mod4 <- lm( Cp ~ poly(Temp, 4), data=heatcap)   # Quartic
zapsmall( summary(hc_mod3,correlation=TRUE)$correlation )

               (Intercept) poly(Temp, 3)1 poly(Temp, 3)2 poly(Temp, 3)3
(Intercept)              1              0              0              0
poly(Temp, 3)1           0              1              0              0
poly(Temp, 3)2           0              0              1              0
poly(Temp, 3)3           0              0              0              1

Transforming Covariates

Transforming Covariates

Transforming Factors

Transforming Covariates

Example: windmill

Example: windmill

Fixing windmill

Example: Simultaneous log transforms

Polynomial Trends

Cautions

Example: heatcap data

Example: heatcap

Example: heatcap

Example: heatcap

Example: heatcap

Example: `windmill`

Example: `windmill`

Fixing `windmill`

Example: `heatcap` data

Example: `heatcap`

Example: `heatcap`

Example: `heatcap`

Example: `heatcap`