Stat 203 Lecture 5
Deviations between observed data \(y_i\) and model \(\mu_i\) are
\[ e_i = y_i - \mu_i = y_i - (\beta_0 + \beta_1 x_i) = y_i - \beta_0 - \beta_1 x_i \]
Goal: Minimize \(S(\beta_0, \beta_1) = \sum\limits_{i=1}^n w_i e_i^2\).
gestation
gestation
library(GLMsData); data(gestation)
plot(Weight ~ Age, data=gestation,
xlab="Age (years)",
ylab="Weight (grams)",
ylim=c(0,4)
)
y <- gestation$Weight
x <- gestation$Age
wts <- gestation$Births
beta0.A <- 0.0; beta1.A <- 0.13 # Try values
mu.A <- beta0.A + beta1.A*x
SA <- sum(wts*(y - mu.A)^2); SA
[1] 4652.819
gestation
gestation
gestation
gestation
y <- gestation$Weight
x <- gestation$Age
wts <- gestation$Births
beta0.C <- -2.678; beta1.C <- 0.1538 # Final values
mu.C <- beta0.C + beta1.C*x
SC <- sum(wts*(y - mu.C)^2); SC
[1] 11.42575
But how do we find these values?
Take (partial) derivatives!
\[ \begin{align} \frac{\partial S(\beta_0,\beta_1)}{\partial \beta_0} &= 2 \sum\limits_{i=1}^n w_i (y_i - \mu_i);\\ \frac{\partial S(\beta_0,\beta_1)}{\partial \beta_1} &= 2 \sum\limits_{i=1}^n w_i x_i (y_i - \mu_i). \end{align} \]
We (you will) find:
\[ \begin{align} \hat\beta_0 &= \overline{y}_w - \hat\beta_1 \overline{x}_w;\\ \hat\beta_1 &= \frac{\text{SS}_{xy}}{\text{SS}_x} = \frac{\sum_{i=1}^n w_i (x_i - \overline{x}_w)y_i}{\sum_{i=1}^n w_i (x_i - \overline{x}_w)^2}, \end{align} \] where \(\overline{x}_w\) and \(\overline{y}_w\) are the weighted means
\[ \overline{x}_w = \frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i} \qquad \text{ and }\qquad \overline{y}_w = \frac{\sum_{i=1}^n w_i y_i}{\sum_{i=1}^n w_i} \]
\[ \text{RSS} = \sum\limits_{i=1}^n w_i (y_i - \hat\mu_i)^2 = \sum\limits_{i=1}^n w_i (y_i - \hat\beta_0 - \hat\beta_1 x_i)^2. \]
xbar <- weighted.mean(x, w=wts) # The weighted mean of x (Age)
SSx <- sum( wts*(x-xbar)^2 )
ybar <- weighted.mean(y, w=wts) # The weighted mean of y (Weight)
SSxy <- sum( wts*(x-xbar)*y )
beta1 <- SSxy / SSx; beta0 <- ybar - beta1*xbar
mu <- beta0 + beta1*x
RSS <- sum( wts*(y - mu )^2 )
c( beta0=beta0, beta1=beta1, RSS=RSS )
beta0 beta1 RSS
-2.6783891 0.1537594 11.4198322
xbar <- weighted.mean(x, w=wts) # The weighted mean of x (Age)
SSx <- sum( wts*(x-xbar)^2 )
ybar <- weighted.mean(y, w=wts) # The weighted mean of y (Weight)
SSxy <- sum( wts*(x-xbar)*y )
beta1 <- SSxy / SSx; beta0 <- ybar - beta1*xbar
mu <- beta0 + beta1*x
RSS <- sum( wts*(y - mu )^2 )
c( beta0=beta0, beta1=beta1, RSS=RSS )
beta0 beta1 RSS
-2.6783891 0.1537594 11.4198322
Call:
lm(formula = Weight ~ Age, data = gestation, weights = Births)
Coefficients:
(Intercept) Age
-2.6784 0.1538
Recall the definition:
\[ \sigma^2/w_i = \text{var}[y_i] = E[(y_i - \mu_i)^2]. \]
\[ s^2 = \frac{\text{RSS}}{n-2}. \]
R
Call:
lm(formula = Weight ~ Age, data = gestation, weights = Births)
Weighted Residuals:
Min 1Q Median 3Q Max
-1.62979 -0.60893 -0.30063 -0.08845 1.03880
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.678389 0.371172 -7.216 7.49e-07 ***
Age 0.153759 0.009493 16.197 1.42e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7753 on 19 degrees of freedom
Multiple R-squared: 0.9325, Adjusted R-squared: 0.9289
F-statistic: 262.3 on 1 and 19 DF, p-value: 1.416e-12
\[ \def\var{{\text{var}}} \]
The variance of the parameter estimates are
\[ \text{var}[\hat\beta_0] = \sigma^2 \left( \frac{1}{\sum w_i} + \frac{\overline{x}^2_w}{\text{SS}_x}\right) \qquad \text{and} \qquad \text{var}[\hat\beta_1] = \frac{\sigma^2}{\text{SS}_x} \]
where \(\overline{x}_w\) is the weighted mean.
We may estimate \(\text{var}[\hat\beta_j]\), denoted \(\widehat\var[\hat\beta_j]\), by substituting \(s^2\) for the unknown true variance \(\sigma^2\).
The standard error of an estimated quantity is its standard deviation. They are the square roots of \(\widehat\var[\hat\beta_j]\).
\[ \var[\hat\mu_g] = \sigma^2 \left(\frac{1}{\sum w_i} + \frac{(x_g - \overline{x})^2}{\text{SS}_x}\right) \]
\[ \text{se}(\hat\mu_g) = \sqrt{\widehat\var[\hat\mu_g]} = \sqrt{s^2 \left(\frac{1}{\sum w_i} + \frac{(x_g - \overline{x})^2}{\text{SS}_x}\right)} \]
R
gpsleep
In GLMsData
, explore the gpsleep
dataset:
Amount of sleep in guinea pigs after receiving ketamine
In two ways, calculate:
Consider: is linear regression appropriate for this situation? Why or why not?