Simple Linear Regression

Stat 203 Lecture 5

Dr. Janssen

Least-Squares Estimation

Problem: estimate \(\beta_0\) and \(\beta_1\)

Deviations between observed data \(y_i\) and model \(\mu_i\) are

\[ e_i = y_i - \mu_i = y_i - (\beta_0 + \beta_1 x_i) = y_i - \beta_0 - \beta_1 x_i \]

Goal: Minimize \(S(\beta_0, \beta_1) = \sum\limits_{i=1}^n w_i e_i^2\).

Example: `gestation`

library(GLMsData); data(gestation)
plot(Weight ~ Age, data=gestation,
     xlab="Age (years)",
     ylab="Weight (grams)",
     ylim=c(0,4)
     )

Example: `gestation`

library(GLMsData); data(gestation)
plot(Weight ~ Age, data=gestation,
     xlab="Age (years)",
     ylab="Weight (grams)",
     ylim=c(0,4)
     )

y <- gestation$Weight
x <- gestation$Age
wts <- gestation$Births
beta0.A <- 0.0; beta1.A <- 0.13 # Try values
mu.A <- beta0.A + beta1.A*x
SA <- sum(wts*(y - mu.A)^2); SA

[1] 4652.819

Example: `gestation`

library(GLMsData); data(gestation)
plot(Weight ~ Age, data=gestation,
     xlab="Age (years)",
     ylab="Weight (grams)",
     ylim=c(0,4),
     xlim=c(0,50)
     )

Example: `gestation`

y <- gestation$Weight
x <- gestation$Age
wts <- gestation$Births
beta0.A <- -0.9; beta1.A <- 0.1 # Updated values
mu.A <- beta0.A + beta1.A*x
SA <- sum(wts*(y - mu.A)^2); SA

Example: `gestation`

y <- gestation$Weight
x <- gestation$Age
wts <- gestation$Births
beta0.B <- -0.9; beta1.B <- 0.1 # Updated values
mu.B <- beta0.B + beta1.B*x
SB <- sum(wts*(y - mu.B)^2); SB

[1] 186.1106

Example: `gestation`

y <- gestation$Weight
x <- gestation$Age
wts <- gestation$Births
beta0.C <- -2.678; beta1.C <- 0.1538 # Final values
mu.C <- beta0.C + beta1.C*x
SC <- sum(wts*(y - mu.C)^2); SC

[1] 11.42575

But how do we find these values?

Coefficient Estimates

Calculus!

Take (partial) derivatives!

\[ \begin{align} \frac{\partial S(\beta_0,\beta_1)}{\partial \beta_0} &= 2 \sum\limits_{i=1}^n w_i (y_i - \mu_i);\\ \frac{\partial S(\beta_0,\beta_1)}{\partial \beta_1} &= 2 \sum\limits_{i=1}^n w_i x_i (y_i - \mu_i). \end{align} \]

Solutions

We (you will) find:

\[ \begin{align} \hat\beta_0 &= \overline{y}_w - \hat\beta_1 \overline{x}_w;\\ \hat\beta_1 &= \frac{\text{SS}_{xy}}{\text{SS}_x} = \frac{\sum_{i=1}^n w_i (x_i - \overline{x}_w)y_i}{\sum_{i=1}^n w_i (x_i - \overline{x}_w)^2}, \end{align} \] where \(\overline{x}_w\) and \(\overline{y}_w\) are the weighted means

\[ \overline{x}_w = \frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i} \qquad \text{ and }\qquad \overline{y}_w = \frac{\sum_{i=1}^n w_i y_i}{\sum_{i=1}^n w_i} \]

Terminology

We call \(\hat\beta_0\) and \(\hat\beta_1\) the least-squares estimators of \(\beta_0\) and \(\beta_1\), respectively.
The fitted values are estimated by \(\hat\mu_i = \hat\beta_0 + \hat\beta_1 x_i\) for \(i = 1,2,\ldots,n\).
The minimized value of \(S(\beta_0, \beta_1)\), evaluated at the least-squares estimates \(\beta_0 = \hat\beta_0\) and \(\beta_1 = \hat\beta_1\), is called the residual sum-of-squares (RSS):

\[ \text{RSS} = \sum\limits_{i=1}^n w_i (y_i - \hat\mu_i)^2 = \sum\limits_{i=1}^n w_i (y_i - \hat\beta_0 - \hat\beta_1 x_i)^2. \]

Using the formulae

xbar <- weighted.mean(x, w=wts)     # The weighted mean of x (Age)
SSx <- sum( wts*(x-xbar)^2 )
ybar <- weighted.mean(y, w=wts)     # The weighted mean of y (Weight)
SSxy <- sum( wts*(x-xbar)*y )
beta1 <- SSxy / SSx; beta0 <- ybar - beta1*xbar
mu <- beta0 + beta1*x
RSS <- sum( wts*(y - mu )^2 )
c( beta0=beta0, beta1=beta1, RSS=RSS )

     beta0      beta1        RSS 
-2.6783891  0.1537594 11.4198322

Using the formulae

xbar <- weighted.mean(x, w=wts)     # The weighted mean of x (Age)
SSx <- sum( wts*(x-xbar)^2 )
ybar <- weighted.mean(y, w=wts)     # The weighted mean of y (Weight)
SSxy <- sum( wts*(x-xbar)*y )
beta1 <- SSxy / SSx; beta0 <- ybar - beta1*xbar
mu <- beta0 + beta1*x
RSS <- sum( wts*(y - mu )^2 )
c( beta0=beta0, beta1=beta1, RSS=RSS )

     beta0      beta1        RSS 
-2.6783891  0.1537594 11.4198322

lm(Weight ~ Age, weights=Births, data=gestation)


Call:
lm(formula = Weight ~ Age, data = gestation, weights = Births)

Coefficients:
(Intercept)          Age  
    -2.6784       0.1538

Estimating the Variance \(\sigma^2\)

Recall the definition:

\[ \sigma^2/w_i = \text{var}[y_i] = E[(y_i - \mu_i)^2]. \]

\[ s^2 = \frac{\text{RSS}}{n-2}. \]

df <- length(y) - 2
s2 <- RSS / df
c(df = df, s=sqrt(s2), s2=s2)

        df          s         s2 
19.0000000  0.7752701  0.6010438

It’s reasonable (though incorrect) to try estimating \(\sigma^2\) by the average of the squared deviations \(\sum w_i (y_i - \hat\mu_i)^2 = \text{RSS}\). We thus land at the naive estimate \(\hat\sigma^2 = \frac{\text{RSS}}{n}\).

If the \(\mu_i\) were known and not estimated (by \(\hat\mu_i\)), this would be an ideal estimator. But since the \(\hat\mu_i\) are already estimated, \(\hat\sigma^2\) is biased (negatively).

The correct approach is to divide by \(n-2\) instead of \(n\). We call \(n-2\) the residual degrees of freedom, which is the number of observations minus the number of coefficients estimated in the systematic component of the linear regression model.

We think of each estimation process (for each coefficient) as “using up” one of the observations in the following sense: if the first \(n-2\) values of \(r_i = y_i - \hat\beta_0 - \hat\beta_1 x_i\) were known, then the remaining two values could be inferred from \(\hat\beta_0\) and \(\hat\beta_1\). So, given the coefficient estimates, there are only \(n-2\) degrees of freedom available to the residuals.

Using `R`

summary(lm(Weight ~ Age, weights=Births, data=gestation))


Call:
lm(formula = Weight ~ Age, data = gestation, weights = Births)

Weighted Residuals:
     Min       1Q   Median       3Q      Max 
-1.62979 -0.60893 -0.30063 -0.08845  1.03880 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.678389   0.371172  -7.216 7.49e-07 ***
Age          0.153759   0.009493  16.197 1.42e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7753 on 19 degrees of freedom
Multiple R-squared:  0.9325,    Adjusted R-squared:  0.9289 
F-statistic: 262.3 on 1 and 19 DF,  p-value: 1.416e-12

Variance and Standard Errors of the Coefficients

The variance of the parameter estimates are

\[ \text{var}[\hat\beta_0] = \sigma^2 \left( \frac{1}{\sum w_i} + \frac{\overline{x}^2_w}{\text{SS}_x}\right) \qquad \text{and} \qquad \text{var}[\hat\beta_1] = \frac{\sigma^2}{\text{SS}_x} \]

where \(\overline{x}_w\) is the weighted mean.

We may estimate \(\text{var}[\hat\beta_j]\), denoted \(\widehat\var[\hat\beta_j]\), by substituting \(s^2\) for the unknown true variance \(\sigma^2\).

Standard Errors for coefficients

The standard error of an estimated quantity is its standard deviation. They are the square roots of \(\widehat\var[\hat\beta_j]\).

In R:

var.b0 <- s2 * ( 1/sum(wts) + xbar^2 / SSx )
var.b1 <- s2 / SSx
sqrt( c( beta0=var.b0, beta1=var.b1) )  # The std errors

      beta0       beta1 
0.371172341 0.009493212

Standard errors of fitted values

\[ \var[\hat\mu_g] = \sigma^2 \left(\frac{1}{\sum w_i} + \frac{(x_g - \overline{x})^2}{\text{SS}_x}\right) \]

\[ \text{se}(\hat\mu_g) = \sqrt{\widehat\var[\hat\mu_g]} = \sqrt{s^2 \left(\frac{1}{\sum w_i} + \frac{(x_g - \overline{x})^2}{\text{SS}_x}\right)} \]

Example in `R`

x.g <- 30
mu.g <- beta0 + x.g * beta1
var.mu.g <- s2 * ( 1/sum(wts) + (x.g-xbar)^2 / SSx )
se.mu.g <- sqrt(var.mu.g)
c( mu=mu.g, se=sqrt(var.mu.g))

      mu       se 
1.934392 0.088124

Exploration

`gpsleep`

In GLMsData, explore the gpsleep dataset:

Amount of sleep in guinea pigs after receiving ketamine

In two ways, calculate:

RSS
coefficient estimates
standard errors

Consider: is linear regression appropriate for this situation? Why or why not?

Simple Linear Regression

Least-Squares Estimation

Problem: estimate \(\beta_0\) and \(\beta_1\)

Example: gestation

Example: gestation

Example: gestation

Example: gestation

Example: gestation

Example: gestation

Coefficient Estimates

Calculus!

Solutions

Terminology

Using the formulae

Using the formulae

Estimating the Variance \(\sigma^2\)

Using R

Variance and Standard Errors of the Coefficients

Standard Errors for coefficients

Standard errors of fitted values

Example in R

Exploration

gpsleep

Example: `gestation`

Example: `gestation`

Example: `gestation`

Example: `gestation`

Example: `gestation`

Example: `gestation`

Using `R`

Example in `R`

`gpsleep`