Simple Linear Regression

Stat 203 Lecture 5

Dr. Janssen

Least-Squares Estimation

Problem: estimate \(\beta_0\) and \(\beta_1\)

Deviations between observed data \(y_i\) and model \(\mu_i\) are

\[ e_i = y_i - \mu_i = y_i - (\beta_0 + \beta_1 x_i) = y_i - \beta_0 - \beta_1 x_i \]

Goal: Minimize \(S(\beta_0, \beta_1) = \sum\limits_{i=1}^n w_i e_i^2\).

Example: gestation

library(GLMsData); data(gestation)
plot(Weight ~ Age, data=gestation,
     xlab="Age (years)",
     ylab="Weight (grams)",
     ylim=c(0,4)
     )

Example: gestation

library(GLMsData); data(gestation)
plot(Weight ~ Age, data=gestation,
     xlab="Age (years)",
     ylab="Weight (grams)",
     ylim=c(0,4)
     )
y <- gestation$Weight
x <- gestation$Age
wts <- gestation$Births
beta0.A <- 0.0; beta1.A <- 0.13 # Try values
mu.A <- beta0.A + beta1.A*x
SA <- sum(wts*(y - mu.A)^2); SA
[1] 4652.819

Example: gestation

library(GLMsData); data(gestation)
plot(Weight ~ Age, data=gestation,
     xlab="Age (years)",
     ylab="Weight (grams)",
     ylim=c(0,4),
     xlim=c(0,50)
     )

Example: gestation

y <- gestation$Weight
x <- gestation$Age
wts <- gestation$Births
beta0.A <- -0.9; beta1.A <- 0.1 # Updated values
mu.A <- beta0.A + beta1.A*x
SA <- sum(wts*(y - mu.A)^2); SA

Example: gestation

y <- gestation$Weight
x <- gestation$Age
wts <- gestation$Births
beta0.B <- -0.9; beta1.B <- 0.1 # Updated values
mu.B <- beta0.B + beta1.B*x
SB <- sum(wts*(y - mu.B)^2); SB
[1] 186.1106

Example: gestation

y <- gestation$Weight
x <- gestation$Age
wts <- gestation$Births
beta0.C <- -2.678; beta1.C <- 0.1538 # Final values
mu.C <- beta0.C + beta1.C*x
SC <- sum(wts*(y - mu.C)^2); SC
[1] 11.42575

But how do we find these values?

Coefficient Estimates

Calculus!

Take (partial) derivatives!

\[ \begin{align} \frac{\partial S(\beta_0,\beta_1)}{\partial \beta_0} &= 2 \sum\limits_{i=1}^n w_i (y_i - \mu_i);\\ \frac{\partial S(\beta_0,\beta_1)}{\partial \beta_1} &= 2 \sum\limits_{i=1}^n w_i x_i (y_i - \mu_i). \end{align} \]

Solutions

We (you will) find:

\[ \begin{align} \hat\beta_0 &= \overline{y}_w - \hat\beta_1 \overline{x}_w;\\ \hat\beta_1 &= \frac{\text{SS}_{xy}}{\text{SS}_x} = \frac{\sum_{i=1}^n w_i (x_i - \overline{x}_w)y_i}{\sum_{i=1}^n w_i (x_i - \overline{x}_w)^2}, \end{align} \] where \(\overline{x}_w\) and \(\overline{y}_w\) are the weighted means

\[ \overline{x}_w = \frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i} \qquad \text{ and }\qquad \overline{y}_w = \frac{\sum_{i=1}^n w_i y_i}{\sum_{i=1}^n w_i} \]

Terminology

  • We call \(\hat\beta_0\) and \(\hat\beta_1\) the least-squares estimators of \(\beta_0\) and \(\beta_1\), respectively.
  • The fitted values are estimated by \(\hat\mu_i = \hat\beta_0 + \hat\beta_1 x_i\) for \(i = 1,2,\ldots,n\).
  • The minimized value of \(S(\beta_0, \beta_1)\), evaluated at the least-squares estimates \(\beta_0 = \hat\beta_0\) and \(\beta_1 = \hat\beta_1\), is called the residual sum-of-squares (RSS):

\[ \text{RSS} = \sum\limits_{i=1}^n w_i (y_i - \hat\mu_i)^2 = \sum\limits_{i=1}^n w_i (y_i - \hat\beta_0 - \hat\beta_1 x_i)^2. \]

Using the formulae

xbar <- weighted.mean(x, w=wts)     # The weighted mean of x (Age)
SSx <- sum( wts*(x-xbar)^2 )
ybar <- weighted.mean(y, w=wts)     # The weighted mean of y (Weight)
SSxy <- sum( wts*(x-xbar)*y )
beta1 <- SSxy / SSx; beta0 <- ybar - beta1*xbar
mu <- beta0 + beta1*x
RSS <- sum( wts*(y - mu )^2 )
c( beta0=beta0, beta1=beta1, RSS=RSS )
     beta0      beta1        RSS 
-2.6783891  0.1537594 11.4198322 

Using the formulae

xbar <- weighted.mean(x, w=wts)     # The weighted mean of x (Age)
SSx <- sum( wts*(x-xbar)^2 )
ybar <- weighted.mean(y, w=wts)     # The weighted mean of y (Weight)
SSxy <- sum( wts*(x-xbar)*y )
beta1 <- SSxy / SSx; beta0 <- ybar - beta1*xbar
mu <- beta0 + beta1*x
RSS <- sum( wts*(y - mu )^2 )
c( beta0=beta0, beta1=beta1, RSS=RSS )
     beta0      beta1        RSS 
-2.6783891  0.1537594 11.4198322 
lm(Weight ~ Age, weights=Births, data=gestation)

Call:
lm(formula = Weight ~ Age, data = gestation, weights = Births)

Coefficients:
(Intercept)          Age  
    -2.6784       0.1538  

Estimating the Variance \(\sigma^2\)

Recall the definition:

\[ \sigma^2/w_i = \text{var}[y_i] = E[(y_i - \mu_i)^2]. \]

\[ s^2 = \frac{\text{RSS}}{n-2}. \]

df <- length(y) - 2
s2 <- RSS / df
c(df = df, s=sqrt(s2), s2=s2)
        df          s         s2 
19.0000000  0.7752701  0.6010438 

Using R

summary(lm(Weight ~ Age, weights=Births, data=gestation))

Call:
lm(formula = Weight ~ Age, data = gestation, weights = Births)

Weighted Residuals:
     Min       1Q   Median       3Q      Max 
-1.62979 -0.60893 -0.30063 -0.08845  1.03880 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.678389   0.371172  -7.216 7.49e-07 ***
Age          0.153759   0.009493  16.197 1.42e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7753 on 19 degrees of freedom
Multiple R-squared:  0.9325,    Adjusted R-squared:  0.9289 
F-statistic: 262.3 on 1 and 19 DF,  p-value: 1.416e-12

Variance and Standard Errors of the Coefficients

The variance of the parameter estimates are

\[ \text{var}[\hat\beta_0] = \sigma^2 \left( \frac{1}{\sum w_i} + \frac{\overline{x}^2_w}{\text{SS}_x}\right) \qquad \text{and} \qquad \text{var}[\hat\beta_1] = \frac{\sigma^2}{\text{SS}_x} \]

where \(\overline{x}_w\) is the weighted mean.

We may estimate \(\text{var}[\hat\beta_j]\), denoted \(\widehat\var[\hat\beta_j]\), by substituting \(s^2\) for the unknown true variance \(\sigma^2\).

Standard Errors for coefficients

The standard error of an estimated quantity is its standard deviation. They are the square roots of \(\widehat\var[\hat\beta_j]\).

In R:

var.b0 <- s2 * ( 1/sum(wts) + xbar^2 / SSx )
var.b1 <- s2 / SSx
sqrt( c( beta0=var.b0, beta1=var.b1) )  # The std errors
      beta0       beta1 
0.371172341 0.009493212 

Standard errors of fitted values

\[ \var[\hat\mu_g] = \sigma^2 \left(\frac{1}{\sum w_i} + \frac{(x_g - \overline{x})^2}{\text{SS}_x}\right) \]

\[ \text{se}(\hat\mu_g) = \sqrt{\widehat\var[\hat\mu_g]} = \sqrt{s^2 \left(\frac{1}{\sum w_i} + \frac{(x_g - \overline{x})^2}{\text{SS}_x}\right)} \]

Example in R

x.g <- 30
mu.g <- beta0 + x.g * beta1
var.mu.g <- s2 * ( 1/sum(wts) + (x.g-xbar)^2 / SSx )
se.mu.g <- sqrt(var.mu.g)
c( mu=mu.g, se=sqrt(var.mu.g))
      mu       se 
1.934392 0.088124 

Exploration

gpsleep

In GLMsData, explore the gpsleep dataset:

Amount of sleep in guinea pigs after receiving ketamine

In two ways, calculate:

  • RSS
  • coefficient estimates
  • standard errors

Consider: is linear regression appropriate for this situation? Why or why not?