Fitting Linear Regression Using R
Stat 203 Lecture 7
Syntax
Goal is to understand how to create linear regression models in R, get the coefficients, and use the model to plot lines of best fit.
General syntax
General syntax
library(GLMsData); data(gestation);
<- lm(Weight ~ Age, data=gestation, weights=Births)
gest.wtd summary(gest.wtd)
Call:
lm(formula = Weight ~ Age, data = gestation, weights = Births)
Weighted Residuals:
Min 1Q Median 3Q Max
-1.62979 -0.60893 -0.30063 -0.08845 1.03880
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.678389 0.371172 -7.216 7.49e-07 ***
Age 0.153759 0.009493 16.197 1.42e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7753 on 19 degrees of freedom
Multiple R-squared: 0.9325, Adjusted R-squared: 0.9289
F-statistic: 262.3 on 1 and 19 DF, p-value: 1.416e-12
Weighted vs Unweighted
library(GLMsData); data(gestation);
<- lm(Weight ~ Age, data=gestation, weights=Births)
gest.wtd <- lm(Weight ~ Age, data=gestation)
gest.ord coef(gest.wtd)
(Intercept) Age
-2.6783891 0.1537594
coef(gest.ord)
(Intercept) Age
-3.049879 0.159483
Plotting
plot( Weight ~ Age, data=gestation, type="n",
las=1, xlim=c(20, 45), ylim=c(0, 4),
xlab="Gestational age (weeks)", ylab="Mean birthweight (in kg)" )
points( Weight[Births< 20] ~ Age[Births< 20], pch=1, data=gestation )
points( Weight[Births>=20] ~ Age[Births>=20], pch=19, data=gestation )
abline( coef(gest.ord), lty=2, lwd=2)
abline( coef(gest.wtd), lty=1, lwd=2)
legend("topleft", lwd=c(2, 2), bty="n",
lty=c(2, 1, NA, NA), pch=c(NA, NA, 1, 19), # NA shows nothing
legend=c("Ordinary regression", "Weighted regression","Based on 20 or fewer obs.","Based on more than 20 obs."))
Accessing components of the model
We can access model components using built-in R
functions, summarized on pp. 79-81 of the text.
(Intercept) Age
-2.6783891 0.1537594
(Intercept) Age
-2.6783891 0.1537594
[1] 0.7752701
[1] "call" "terms" "weights" "residuals"
[5] "coefficients" "aliased" "sigma" "df"
[9] "r.squared" "adj.r.squared" "fstatistic" "cov.unscaled"
Interpreting the Regression Coefficients
Review
Given a simple linear regression model with systematic component
\[ \mu_i = \beta_0 + \beta_1 x_i, \] how did we interpret a one-unit increase in \(x_i\)?
Inference for Linear Regression Models
Normal Linear Regression Models
- Up to now, no specific distribution has been assumed for the responses.
- For formal inference, we need to be specific.
- Typical assumption: normally distributed responses with constant variance (unweighted) or variance proportional to known weights
\[ \begin{cases} y_i &\sim N(\mu_i, \sigma^2/w_i)\\ \mu_i &= \beta_0 + \sum\limits_{j=1}^p \beta_j x_{ji} \end{cases} \tag{1}\]
The distribution of \(\hat{\beta}_j\)
\[ \hat{\beta}_j \sim N(\beta_j, \text{var}[\hat{\beta}_j]) \tag{2}\]
From Equation 2,
\[ Z = \frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)}. \tag{3}\]
Earlier we gave expressions for estimating \(\text{var}[\hat{\beta}_j]\). In simple linear regression, we had
\[ \text{var}[\hat\beta_0] = \sigma^2 \left( \frac{1}{\sum w_i} + \frac{\overline{x}^2_w}{\text{SS}_x}\right) \qquad \text{and} \qquad \text{var}[\hat\beta_1] = \frac{\sigma^2}{\text{SS}_x} \]
When we adopt a normal linear regression model, the entire distributions of the \(\beta_j\)’s are known, not just the variances. Under our new model, \(\hat{\beta}_j\) is a random variable that follows a normal distribution, since \(\hat{\beta}_j\) is a linear combination of the \(y_i\). Specifically we have Equation 2.
So \(\hat{\beta}_j\) has a normal distribution with mean \(\beta_j\) and variance \(\text{var}[\hat\beta_j]\). Recall that \(\text{var}[\hat{\beta}_j]\) is ap roduct of \(\sigma\) (and is thus approximately proportional to \(\sqrt{n}\)) and the known values of the explanatory variable and weights.
In Equation 3, the SE is the square root of the variance, and \(Z\) is standard normal when \(\sigma^2\) is known. When \(\sigma^2\) is unknown, we estimate \(\sigma^2\) by \(s^2\) and hence estimate \(\text{var}[\hat{\beta}_j]\).
Then
\[ T = \frac{\hat\beta_j - \beta_j}{\text{se}(\hat\beta_j)} \tag{4}\]
has a Student’s \(t\) distribution with \(n - p'\) df, which converges to the standard normal as df increases (eg through an increased sample size).
Hypothesis tests for \(\beta_j\)
We use the statistic
\[ T = \frac{\hat{\beta}_j - \beta_j^0}{\text{se}(\hat{\beta}_j)} \tag{5}\]
to test the null hypothesis \(H_0 : \beta_j = \beta_j^0\) against an alternative (one-sided or two-sided).
When \(H_0\) is true, \(T\) has a \(t\)-distribution with \(n-p'\) df when \(\sigma^2\) is unknown, so we determine significance by referring to this distribution. Each individual \(t\)-test determines whether evidence exists that the parameter is statistically significantly different from \(\beta_j^0\) in the presence of the other variables currently in the model.
Example 1 We can use coef()
on a summary()
of a lm()
to get the following table.
data(lungcap)
$Smoke <- factor(lungcap$Smoke, levels=c(0, 1),
lungcaplabels=c("Non-smoker","Smoker"))
<- lm(log(FEV) ~ Age + Ht + Gender + Smoke,data=lungcap)
LC.m1 round(coef(summary(LC.m1)),5)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.94400 0.07864 -24.72067 0.00000
Age 0.02339 0.00335 6.98449 0.00000
Ht 0.04280 0.00168 25.48933 0.00000
GenderM 0.02932 0.01172 2.50196 0.01260
SmokeSmoker -0.04607 0.02091 -2.20311 0.02794
Here:
- The
Estimate
column contains the parameter estimates \(\hat{\beta}_j\) - The
Std. Error
column contains the corresponding standard errors \(\text{se}(\hat{\beta}_j)\) - The
t value
column contains the corresponding \(t\)-statistic (Equation 5) for testing \(H_0\) - The
Pr(>|t|)
column contains the corresponding two-tailed \(p\)-values for the hypothesis tests
Confidence intervals for \(\beta_j\)
Confidence intervals give us a sense of the size of an observed effect. You can read about the formula on p. 55 of the text, or use the confint()
on a lm()
. By default, R
produces 95% confidence intervals; you can set, e.g., level=0.90
to calculate 90% confidence intervals instead.
Confidence intervals for \(\mu\)
Computing confidence intervals for \(\mu\) is a bit different, as the fitted values \(\hat{\mu}\) depend in turn on estimates for \(\hat{\beta}_j\). In this case, the \(100(1-\alpha)\)% confidence interval for the fitted value is
\[ \hat{\mu}_g \pm t^*_{\alpha/2,n-p'}\text{se}(\hat\mu), \tag{6}\]
where \(\text{se}(\hat\mu_g) = \sqrt{\text{var}[\hat\mu_g]}\), and where \(t^*_{\alpha/2,n-p'}\) is the value such that an area \(\alpha/2\) is in each tail of the \(t\)-distribution with \(n-p'\) df.
What R
gives us is the standard errors when we make predictions using predict()
with the input se.fit=TRUE
. We can then form the confidence intervals directly.
Example 2 (Example 2.18, p. 56) For the lung capacity data (data set: lungcap
), suppose we wish to estimate \(\mu = E[log\)(FEV
\()]\) for female smokers aged 18 who are 66 inches tall. Using R
, we first create a new data frame containing the values of the explanatory variables for which we need to make the prediction:
<- data.frame(Age=18, Ht=66, Gender="F", Smoke="Smoker") new.df
Then, use predict()
to compute the estimates of \(\mu\):
<- predict( LC.m1, newdata=new.df, se.fit=TRUE)
out names(out)
[1] "fit" "se.fit" "df" "residual.scale"
$se.fit out
[1] 0.02350644
<- qt(df=LC.m1$df, p=0.975 ) # For a 95% CI
tstar <- out$fit - tstar*out$se.fit
ci.lo <- out$fit + tstar*out$se.fit
ci.hi <- cbind( Lower=ci.lo, Estimate=out$fit, Upper=ci.hi)
CIinfo CIinfo
Lower Estimate Upper
1 1.209268 1.255426 1.301584
The prediction is \(\hat\mu = 1.255\) and the 95% confidence interval is from 1.209 to 1.302. An approximate confidence interval for \(E[\)FEV
\(]\) is thus
exp(CIinfo)
Lower Estimate Upper
1 3.351032 3.509334 3.675114
Analysis of Variance for Regression Models
Our linear regression models yield fitted values \(\hat{\mu}_i\) for each observation \(y_i\). We can therefore separate each observation into a component predicted by the model (\(\hat{\mu}_i\)) and the residual that is left over (\(y_i - \hat{\mu}_i\)):
\[ y_i = \hat{\mu}_i + (y_i - \hat{\mu}_i). \]
That is, \(\text{Data} = \text{Fit} + \text{Residual}\). To evaluate the contributions of each of the covariates \(x_{ij}\), we explore the mean-corrected data,
\[ y_i - \overline{y}_w = (\hat{\mu}_i - \overline{y}_w) + (y_i - \hat{\mu}_i), \tag{7}\]
where \(\overline{y}_w = \sum_i w_i y_i/ \sum_i w_i\) is the weighted mean of the observations. Squaring each term of Equation 7 and summing over \(i\) yields the identity
\[ \text{SST} = \text{SSREG} + \text{RSS}, \tag{8}\]
where:
- \(\text{SST} = \sum_{i=1}^n w_i (y_i - \overline{y}_w)^2\) is the total sum of squares
- \(\text{SSREG} = \sum_{i=1}^n w_i (\hat{\mu}_i - \overline{y}_w)^2\) is the regression sum of squares and
- \(\text{RSS} = \sum_{i=1}^n w_i (y_i - \hat{\mu}_i)^2\) is the residual sum of squares.
The identity shown in Equation 8 encapsulates the notion that variation comes from two sources: a systematic component (SSREG) and a random component (RSS). This is the basis for analysis of variance, because it describes the ways in which variance in the data arises.
In order to determine whether the explanatory variables are useful predictors of the response, we can calculate the \(F\)-statistic, which tests whether SSREG is larger than would be expected due to random variation, i.e., whether SSREG is large relative to RSS after taking the number of explanatory variables into account. The \(F\)-statistic is given by
\[ F = \frac{\text{SSREG}/(p'- 1)}{RSS/(n-p')}. \]
A large value for \(F\) means that the proportion of the variation that can be explained by the systematic component is large relative to \(s^2\), the unbiased estimator of \(\sigma^2\); a small value for \(F\) means that the proportion of the variation that can be explained by the systematic is small relative to \(s^2\).
The summary()
function does not show the ANOVA table, though it does show the \(F\)-statistic. You can asccess the ANOVA table by applying anova()
to your model.
Coefficient of Determination
The proportion of the total variation explained by the regression is the coefficient of determination,
\[ R^2 = \frac{\text{SSREG}}{\text{SST}} = 1 - \frac{\text{RSS}}{\text{SST}}. \]
Note that by definition \(0\le R^2 \le 1\).
The main downside of \(R^2\) is that it tends to increase when adding any explanatory variable to the model, even if the new variable has little or no explanatory power. Thus, we may adjust \(R^2\) as follows:
\[ \overline{R}^2 = 1 - \frac{\text{RSS}/(n-p')}{\text{SST}/(n-1)} = 1 - (1-R^2) \frac{n-1}{n-p}. \tag{9}\]
Note that this is the value that R
reports in the model summary()
.
Question: Why does the quantity described in Equation 9 fix the problem of \(R^2\) increasing when new explanatory variables are added?
Exploring the Term
dataset
In what follows, we’ll explore the term
dataset. About this dataset (source):
Like all firms, life insurance companies continually seek new ways to deliver products to the market. Those involved in product development wish to know “who buys insurance and how much do they buy?” Analysts can readily get information on characteristics of current customers through company databases. Potential customers, those that do not have insurance with the company, are often the main focus for expanding market share. We examine the Survey of Consumer Finances (SCF), a nationally representative sample that contains extensive information on assets, liabilities, in- come, and demographic characteristics of those sampled (potential U.S. customers). We study a random sample of 500 households with positive incomes that were interviewed in the 2004 survey. For term life insurance, the quantity of insurance is measured by the policy
FACE
, the amount that the company will pay in the event of the death of the named insured.
<- read.csv("https://prof.mkjanssen.org/glm/data/TermLife.csv", header=TRUE) # import the data
Term # PICK THE SUBSET OF THE DATA CORRESPONDING TO TERM PURCHASE
<- subset(Term, subset=FACE > 0) Term2
The variables are:
GENDER
: The gender of the survey respondendentAGE
: Age of the survey respondentMARSTAT
: Marital status of the survey respondent (=1 if married, =2 if living with partner, and =0 otherwise)EDUCATION
: Number of years of education of the survey respondentETHNICITY
: EthnicitySMARSTAT
: Marital status of the respondent’s spouseSGENDER
: Gender of the respondent’s spouseSAGE
: Age of the respondent’s spouseSEDUCATION
: Education of the respondent’s spouseNUMHH
: Number of household membersINCOME
: Annual income of the familiyCHARITY
: Charitable contributionsFACE
: Amount that the company will pay in the event of the death of the named insuredFACECVLIFEPOLICIES
: Face amount of life insurance policy with a cash valueCASHCVLIFEPOLICIES
: Cash value of life insurance policy with a cash valueBORROWCVLIFEPOL
: Amount borrowed on life insurance policy with a cash valueNETVALUE
: Net amount at risk on life insurance policy with a cash value
Your job: work with up to one other person to develop a multiple linear regression model to predict the amount of coverage a customer may buy. Make sure you do proper exploratory data analysis (you may find you want to do a log-transform on one or more variables!). Identify significance levels for your parameter estimates, as well as confidence intervals for the parameters and fitted value for representative customers. You should also explain the proportion of the variation predicted by your model compared to the variation attributed to the random component.