Intro to Linear Regression

Stat 203 Lecture 4

Dr. Janssen

Linear Regression Models Defined

Setup

  • Response variable \(y\)
  • Explanatory variables \(x_1, x_2, \ldots, x_p\)
  • Random component: assumes the responses \(y_i\) have constant variances \(\sigma^2\) or that the variances are proportional to known, positive prior weights \(w_i\): \(\text{var}[y_i] = \sigma^2/w_i\).
  • This allows the possibility of giving more weight to some observations than others.
  • Systematic component: \(\mu_i = \beta_0 + \sum_{j=1}^p \beta_j x_{ji}\).

General Form

\[ \begin{cases} \text{var}[y_i] = \sigma^2/w_i \\ \mu_i = \beta_0 + \sum\limits_{j=1}^p \beta_j x_{ji} \end{cases} \qquad(1)\]

where \(E[y_i] = \mu_i\), and the prior weights \(w_i\) are known.

The regression parameters \(\beta_0, \beta_1, \ldots, \beta_p\), as well as the error variance \(\sigma^2\), are unknown and must be estimated from the data.

  • Number of regression parameters is \(p' = p + 1\)
  • \(\beta_0\) is often called the intercept
  • \(\beta_1, \beta_2, \ldots, \beta_p\) are sometimes called the slopes

More terminology

  • When \(p = 1\), i.e., \(\mu = \beta_0 + \beta_1 x_1\), we have a simple linear regression model
  • A model with all prior weights \(w_i = 1\) is called ordinary (as opposed to weighted)
  • When \(p > 1\), we have a multiple linear regression model.

Assumptions

  • Suitability: the same regression model is appropriate for all the observations
  • Linearity: the true relationship between \(\mu\) and each quantitative explanatory variable is linear
  • Constant variance: the unknown part of the variance of the responses, \(\sigma^2\), is constant
  • Independence: the responses \(y\) are independent of one another

Exploration

Is simple linear regression a reasonable tool for modeling the following situations?

  • A researcher suspects that loud music can affect how quickly drivers react. She randomly selects drivers to drive the same stretch of road with varying levels of music volume. Stopping distances for each driver are measured along with the decibel level of the music on their car radio. Response: reaction time. Explanatory: decibel level of music.
  • Medical researchers investigated the outcome of a particular surgery for patients with comparable stages of disease but different ages. The ten hospitals in the study had at least two surgeons performing the surgery of interest. Patients were randomly selected for each surgeon at each hospital. The surgery outcome was recorded on a scale of 1-10. Response variable: Surgery outcome, scale 1-10. Explanatory variable: Patient age, in years

Example

Code
library(GLMsData); data(gestation); str(gestation)
'data.frame':   21 obs. of  4 variables:
 $ Age   : int  22 23 25 27 28 29 30 31 32 33 ...
 $ Births: int  1 1 1 1 6 1 3 6 7 7 ...
 $ Weight: num  0.52 0.7 1 1.17 1.2 ...
 $ SD    : num  NA NA NA NA 0.121 NA 0.589 0.319 0.438 0.313 ...

Example

Code
library(GLMsData); data(gestation);
plot( Weight ~ Age, data=gestation, las=1, pch=ifelse( Births<20, 1, 19),
    xlab="Gestational age (weeks)", ylab="Mean birthweight (kg)",
    xlim=c(20, 45), ylim=c(0, 4))

A possible model for the data is

\[ \begin{cases} \text{var}[y_i] = \sigma^2 / m_i\\ \mu_i = \beta_0 + \beta_1 x_i \end{cases} \qquad(2)\]

The model given in Equation 2 is weighted linear regression model.

Exploration: Bank Salary Data

Response variable of interest: sal77 (annual salary in 1977) and/or bsal (annual salary at the time of hire)

  • sex: MALE or FEMALE
  • senior : months since hired
  • age: age in months
  • educ: years of education
  • exper: months of prior work experience
salary <- read.csv("https://prof.mkjanssen.org/glm/data/banksalary.csv")
summary(salary)
      bsal          sal77           sex                senior     
 Min.   :3900   Min.   : 7860   Length:93          Min.   :65.00  
 1st Qu.:4980   1st Qu.: 9000   Class :character   1st Qu.:74.00  
 Median :5400   Median :10020   Mode  :character   Median :84.00  
 Mean   :5420   Mean   :10393                      Mean   :82.28  
 3rd Qu.:6000   3rd Qu.:11220                      3rd Qu.:90.00  
 Max.   :8100   Max.   :16320                      Max.   :98.00  
      age             educ           exper      
 Min.   :280.0   Min.   : 8.00   Min.   :  0.0  
 1st Qu.:349.0   1st Qu.:12.00   1st Qu.: 35.5  
 Median :468.0   Median :12.00   Median : 70.0  
 Mean   :474.4   Mean   :12.51   Mean   :100.9  
 3rd Qu.:590.0   3rd Qu.:15.00   3rd Qu.:144.0  
 Max.   :774.0   Max.   :16.00   Max.   :381.0  
head(salary)
  bsal sal77  sex senior age educ exper
1 5040 12420 MALE     96 329   15  14.0
2 6300 12060 MALE     82 357   15  72.0
3 6000 15120 MALE     67 315   15  35.5
4 6000 16320 MALE     97 354   12  24.0
5 6000 12300 MALE     66 351   12  56.0
6 6840 10380 MALE     92 374   15  41.5

Exploration: bsal vs exper

Code
plot( bsal ~ exper, data=salary, las=1,
    xlab="Experience (months)", ylab="Starting salary (dollars)",
    ylim=c(0,8500)
    )