Fitting Linear Regression Using R

Stat 203 Lecture 7

Author

Dr. Janssen

Syntax

Goal is to understand how to create linear regression models in R, get the coefficients, and use the model to plot lines of best fit.

General syntax

library(GLMsData); data(gestation);
gest.wtd <- lm(Weight ~ Age, data=gestation, weights=Births)
summary(gest.wtd)

General syntax

library(GLMsData); data(gestation);
gest.wtd <- lm(Weight ~ Age, data=gestation, weights=Births)
summary(gest.wtd)

Call:
lm(formula = Weight ~ Age, data = gestation, weights = Births)

Weighted Residuals:
     Min       1Q   Median       3Q      Max 
-1.62979 -0.60893 -0.30063 -0.08845  1.03880 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.678389   0.371172  -7.216 7.49e-07 ***
Age          0.153759   0.009493  16.197 1.42e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7753 on 19 degrees of freedom
Multiple R-squared:  0.9325,    Adjusted R-squared:  0.9289 
F-statistic: 262.3 on 1 and 19 DF,  p-value: 1.416e-12

Weighted vs Unweighted

library(GLMsData); data(gestation);
gest.wtd <- lm(Weight ~ Age, data=gestation, weights=Births)
gest.ord <- lm(Weight ~ Age, data=gestation)
coef(gest.wtd)
(Intercept)         Age 
 -2.6783891   0.1537594 
coef(gest.ord)
(Intercept)         Age 
  -3.049879    0.159483 

Plotting

plot( Weight ~ Age, data=gestation, type="n",
    las=1, xlim=c(20, 45), ylim=c(0, 4),
    xlab="Gestational age (weeks)", ylab="Mean birthweight (in kg)" )
points( Weight[Births< 20] ~ Age[Births< 20], pch=1,  data=gestation )
points( Weight[Births>=20] ~ Age[Births>=20], pch=19, data=gestation )
abline( coef(gest.ord), lty=2, lwd=2)
abline( coef(gest.wtd), lty=1, lwd=2)
legend("topleft", lwd=c(2, 2), bty="n",
    lty=c(2, 1, NA, NA), pch=c(NA, NA, 1, 19),   # NA shows nothing
    legend=c("Ordinary regression", "Weighted regression","Based on 20 or fewer obs.","Based on more than 20 obs."))

Accessing components of the model

We can access model components using built-in R functions, summarized on pp. 79-81 of the text.

gest.wtd$coefficients
(Intercept)         Age 
 -2.6783891   0.1537594 
coef(gest.wtd)
(Intercept)         Age 
 -2.6783891   0.1537594 
summary(gest.wtd)$sigma
[1] 0.7752701
names(summary(gest.wtd))
 [1] "call"          "terms"         "weights"       "residuals"    
 [5] "coefficients"  "aliased"       "sigma"         "df"           
 [9] "r.squared"     "adj.r.squared" "fstatistic"    "cov.unscaled" 

Interpreting the Regression Coefficients

Review

Given a simple linear regression model with systematic component

\[ \mu_i = \beta_0 + \beta_1 x_i, \] how did we interpret a one-unit increase in \(x_i\)?

Inference for Linear Regression Models

Normal Linear Regression Models

  • Up to now, no specific distribution has been assumed for the responses.
  • For formal inference, we need to be specific.
  • Typical assumption: normally distributed responses with constant variance (unweighted) or variance proportional to known weights

\[ \begin{cases} y_i &\sim N(\mu_i, \sigma^2/w_i)\\ \mu_i &= \beta_0 + \sum\limits_{j=1}^p \beta_j x_{ji} \end{cases} \tag{1}\]

The distribution of \(\hat{\beta}_j\)

\[ \hat{\beta}_j \sim N(\beta_j, \text{var}[\hat{\beta}_j]) \tag{2}\]

From Equation 2,

\[ Z = \frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)}. \tag{3}\]

Earlier we gave expressions for estimating \(\text{var}[\hat{\beta}_j]\). In simple linear regression, we had

\[ \text{var}[\hat\beta_0] = \sigma^2 \left( \frac{1}{\sum w_i} + \frac{\overline{x}^2_w}{\text{SS}_x}\right) \qquad \text{and} \qquad \text{var}[\hat\beta_1] = \frac{\sigma^2}{\text{SS}_x} \]

When we adopt a normal linear regression model, the entire distributions of the \(\beta_j\)’s are known, not just the variances. Under our new model, \(\hat{\beta}_j\) is a random variable that follows a normal distribution, since \(\hat{\beta}_j\) is a linear combination of the \(y_i\). Specifically we have Equation 2.

So \(\hat{\beta}_j\) has a normal distribution with mean \(\beta_j\) and variance \(\text{var}[\hat\beta_j]\). Recall that \(\text{var}[\hat{\beta}_j]\) is ap roduct of \(\sigma\) (and is thus approximately proportional to \(\sqrt{n}\)) and the known values of the explanatory variable and weights.

In Equation 3, the SE is the square root of the variance, and \(Z\) is standard normal when \(\sigma^2\) is known. When \(\sigma^2\) is unknown, we estimate \(\sigma^2\) by \(s^2\) and hence estimate \(\text{var}[\hat{\beta}_j]\).

Then

\[ T = \frac{\hat\beta_j - \beta_j}{\text{se}(\hat\beta_j)} \tag{4}\]

has a Student’s \(t\) distribution with \(n - p'\) df, which converges to the standard normal as df increases (eg through an increased sample size).

Hypothesis tests for \(\beta_j\)

We use the statistic

\[ T = \frac{\hat{\beta}_j - \beta_j^0}{\text{se}(\hat{\beta}_j)} \tag{5}\]

to test the null hypothesis \(H_0 : \beta_j = \beta_j^0\) against an alternative (one-sided or two-sided).

When \(H_0\) is true, \(T\) has a \(t\)-distribution with \(n-p'\) df when \(\sigma^2\) is unknown, so we determine significance by referring to this distribution. Each individual \(t\)-test determines whether evidence exists that the parameter is statistically significantly different from \(\beta_j^0\) in the presence of the other variables currently in the model.

Example 1 We can use coef() on a summary() of a lm() to get the following table.

data(lungcap)
lungcap$Smoke <- factor(lungcap$Smoke, levels=c(0, 1),
                     labels=c("Non-smoker","Smoker"))
LC.m1 <- lm(log(FEV) ~ Age + Ht + Gender + Smoke,data=lungcap)
round(coef(summary(LC.m1)),5)
            Estimate Std. Error   t value Pr(>|t|)
(Intercept) -1.94400    0.07864 -24.72067  0.00000
Age          0.02339    0.00335   6.98449  0.00000
Ht           0.04280    0.00168  25.48933  0.00000
GenderM      0.02932    0.01172   2.50196  0.01260
SmokeSmoker -0.04607    0.02091  -2.20311  0.02794

Here:

  • The Estimate column contains the parameter estimates \(\hat{\beta}_j\)
  • The Std. Error column contains the corresponding standard errors \(\text{se}(\hat{\beta}_j)\)
  • The t value column contains the corresponding \(t\)-statistic (Equation 5) for testing \(H_0\)
  • The Pr(>|t|) column contains the corresponding two-tailed \(p\)-values for the hypothesis tests

Confidence intervals for \(\beta_j\)

Confidence intervals give us a sense of the size of an observed effect. You can read about the formula on p. 55 of the text, or use the confint() on a lm(). By default, R produces 95% confidence intervals; you can set, e.g., level=0.90 to calculate 90% confidence intervals instead.

Confidence intervals for \(\mu\)

Computing confidence intervals for \(\mu\) is a bit different, as the fitted values \(\hat{\mu}\) depend in turn on estimates for \(\hat{\beta}_j\). In this case, the \(100(1-\alpha)\)% confidence interval for the fitted value is

\[ \hat{\mu}_g \pm t^*_{\alpha/2,n-p'}\text{se}(\hat\mu), \tag{6}\]

where \(\text{se}(\hat\mu_g) = \sqrt{\text{var}[\hat\mu_g]}\), and where \(t^*_{\alpha/2,n-p'}\) is the value such that an area \(\alpha/2\) is in each tail of the \(t\)-distribution with \(n-p'\) df.

What R gives us is the standard errors when we make predictions using predict() with the input se.fit=TRUE. We can then form the confidence intervals directly.

Example 2 (Example 2.18, p. 56) For the lung capacity data (data set: lungcap), suppose we wish to estimate \(\mu = E[log\)(FEV\()]\) for female smokers aged 18 who are 66 inches tall. Using R, we first create a new data frame containing the values of the explanatory variables for which we need to make the prediction:

new.df <- data.frame(Age=18, Ht=66, Gender="F", Smoke="Smoker")

Then, use predict() to compute the estimates of \(\mu\):

out <- predict( LC.m1, newdata=new.df, se.fit=TRUE)
names(out)
[1] "fit"            "se.fit"         "df"             "residual.scale"
out$se.fit
[1] 0.02350644
tstar <- qt(df=LC.m1$df, p=0.975 ) # For a 95% CI
ci.lo <- out$fit - tstar*out$se.fit
ci.hi <- out$fit + tstar*out$se.fit
CIinfo <- cbind( Lower=ci.lo, Estimate=out$fit, Upper=ci.hi)
CIinfo
     Lower Estimate    Upper
1 1.209268 1.255426 1.301584

The prediction is \(\hat\mu = 1.255\) and the 95% confidence interval is from 1.209 to 1.302. An approximate confidence interval for \(E[\)FEV\(]\) is thus

exp(CIinfo)
     Lower Estimate    Upper
1 3.351032 3.509334 3.675114

Analysis of Variance for Regression Models

Our linear regression models yield fitted values \(\hat{\mu}_i\) for each observation \(y_i\). We can therefore separate each observation into a component predicted by the model (\(\hat{\mu}_i\)) and the residual that is left over (\(y_i - \hat{\mu}_i\)):

\[ y_i = \hat{\mu}_i + (y_i - \hat{\mu}_i). \]

That is, \(\text{Data} = \text{Fit} + \text{Residual}\). To evaluate the contributions of each of the covariates \(x_{ij}\), we explore the mean-corrected data,

\[ y_i - \overline{y}_w = (\hat{\mu}_i - \overline{y}_w) + (y_i - \hat{\mu}_i), \tag{7}\]

where \(\overline{y}_w = \sum_i w_i y_i/ \sum_i w_i\) is the weighted mean of the observations. Squaring each term of Equation 7 and summing over \(i\) yields the identity

\[ \text{SST} = \text{SSREG} + \text{RSS}, \tag{8}\]

where:

  • \(\text{SST} = \sum_{i=1}^n w_i (y_i - \overline{y}_w)^2\) is the total sum of squares
  • \(\text{SSREG} = \sum_{i=1}^n w_i (\hat{\mu}_i - \overline{y}_w)^2\) is the regression sum of squares and
  • \(\text{RSS} = \sum_{i=1}^n w_i (y_i - \hat{\mu}_i)^2\) is the residual sum of squares.

The identity shown in Equation 8 encapsulates the notion that variation comes from two sources: a systematic component (SSREG) and a random component (RSS). This is the basis for analysis of variance, because it describes the ways in which variance in the data arises.

In order to determine whether the explanatory variables are useful predictors of the response, we can calculate the \(F\)-statistic, which tests whether SSREG is larger than would be expected due to random variation, i.e., whether SSREG is large relative to RSS after taking the number of explanatory variables into account. The \(F\)-statistic is given by

\[ F = \frac{\text{SSREG}/(p'- 1)}{RSS/(n-p')}. \]

A large value for \(F\) means that the proportion of the variation that can be explained by the systematic component is large relative to \(s^2\), the unbiased estimator of \(\sigma^2\); a small value for \(F\) means that the proportion of the variation that can be explained by the systematic is small relative to \(s^2\).

The summary() function does not show the ANOVA table, though it does show the \(F\)-statistic. You can asccess the ANOVA table by applying anova() to your model.

Coefficient of Determination

The proportion of the total variation explained by the regression is the coefficient of determination,

\[ R^2 = \frac{\text{SSREG}}{\text{SST}} = 1 - \frac{\text{RSS}}{\text{SST}}. \]

Note that by definition \(0\le R^2 \le 1\).

The main downside of \(R^2\) is that it tends to increase when adding any explanatory variable to the model, even if the new variable has little or no explanatory power. Thus, we may adjust \(R^2\) as follows:

\[ \overline{R}^2 = 1 - \frac{\text{RSS}/(n-p')}{\text{SST}/(n-1)} = 1 - (1-R^2) \frac{n-1}{n-p}. \tag{9}\]

Note that this is the value that R reports in the model summary().

Question: Why does the quantity described in Equation 9 fix the problem of \(R^2\) increasing when new explanatory variables are added?

Exploring the Term dataset

In what follows, we’ll explore the term dataset. About this dataset (source):

Like all firms, life insurance companies continually seek new ways to deliver products to the market. Those involved in product development wish to know “who buys insurance and how much do they buy?” Analysts can readily get information on characteristics of current customers through company databases. Potential customers, those that do not have insurance with the company, are often the main focus for expanding market share. We examine the Survey of Consumer Finances (SCF), a nationally representative sample that contains extensive information on assets, liabilities, in- come, and demographic characteristics of those sampled (potential U.S. customers). We study a random sample of 500 households with positive incomes that were interviewed in the 2004 survey. For term life insurance, the quantity of insurance is measured by the policy FACE, the amount that the company will pay in the event of the death of the named insured.

Term <- read.csv("https://prof.mkjanssen.org/glm/data/TermLife.csv", header=TRUE) # import the data
#  PICK THE SUBSET OF THE DATA CORRESPONDING TO TERM PURCHASE
Term2 <- subset(Term, subset=FACE > 0)

The variables are:

  • GENDER: The gender of the survey respondendent
  • AGE: Age of the survey respondent
  • MARSTAT: Marital status of the survey respondent (=1 if married, =2 if living with partner, and =0 otherwise)
  • EDUCATION: Number of years of education of the survey respondent
  • ETHNICITY: Ethnicity
  • SMARSTAT: Marital status of the respondent’s spouse
  • SGENDER: Gender of the respondent’s spouse
  • SAGE: Age of the respondent’s spouse
  • SEDUCATION: Education of the respondent’s spouse
  • NUMHH: Number of household members
  • INCOME: Annual income of the familiy
  • CHARITY: Charitable contributions
  • FACE: Amount that the company will pay in the event of the death of the named insured
  • FACECVLIFEPOLICIES: Face amount of life insurance policy with a cash value
  • CASHCVLIFEPOLICIES: Cash value of life insurance policy with a cash value
  • BORROWCVLIFEPOL: Amount borrowed on life insurance policy with a cash value
  • NETVALUE: Net amount at risk on life insurance policy with a cash value

Your job: work with up to one other person to develop a multiple linear regression model to predict the amount of coverage a customer may buy. Make sure you do proper exploratory data analysis (you may find you want to do a log-transform on one or more variables!). Identify significance levels for your parameter estimates, as well as confidence intervals for the parameters and fitted value for representative customers. You should also explain the proportion of the variation predicted by your model compared to the variation attributed to the random component.