Regression Models

Stat 203 Lecture 3

Dr. Janssen

What are they?

General Definition and Notation

Definition 1 A regression model is a statistical model that assumes the mean response \(\mu_i\) for observation \(i\) depends on the \(p\) explanatory variables \(x_{1i}, x_{2i}, \ldots, x_{pi}\) via some general function \(f\) through a number of regression parameters \(\beta_j\):

\[ E[y_i] = \mu_i = f(x_{1i}, \ldots,x_{pi}; \beta_0, \beta_1, \ldots, \beta_q). \]

More to the point (for this class):

\[ \mu_i = f(\beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi}) \qquad(1)\]

Regression models with the form given in Equation 1 are said to be linear in the parameters. All the models we discuss this semester have this form.

The component \(\beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi}\) is called the linear predictor.

Two Special Types

Definition 2 (Linear Regression) The systematic component of a linear regression model assumes the form

\[ E[y_i] = \mu_i = \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi}, \] while the randomness is assumed to have constant variance \(\sigma^2\) about \(\mu_i\).

Definition 3 (Generalized Linear Model) The systematic component of a generalized linear model assumes the form

\[ \begin{align*} \mu_i &= g^{-1}(\beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi})\\ \text{or: } g(\mu_i) &= \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi}, \end{align*} \] where \(g\), called the link function, is a monotonic, differentiable function.

The randomness is given via a specific family of probability distributions, chosen based on the situation.

Notational Conventions

The following notational conventions apply.

  • The number of explanatory variables is \(p\): \(x_1, x_2,\ldots, x_p\).
  • The number of regression parameters is \(p'\).
  • If the systematic component has a constant term \(\beta_0\), then \(p' = p + 1\). Otherwise \(p' = p\).

Example: lime

Recall: lime contains data on 385 small-leaved lime trees grown in Russia.

  • \(y\), Foliage: the foliage biomass, in kg (oven dried matter)
  • \(x_1\), DBH: the tree diameter, at breast height, in cm
  • \(x_2\), Age: the age of the tree, in years
  • \(x_3, x_4\), Origin: the origin of the tree; one of Coppice, Natural, Planted

One potential linear regression model is: \[ \begin{cases} \text{var}[y_i] = \sigma^2 & \text{(random component)}\\ \mu_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 & \text{(systematic component)} \end{cases} \qquad(2)\]

Other possible systematic components?

\[ \begin{align} \mu &= \beta_0 &+ \beta_1 x_1 & & & \\ \mu &= \beta_0 & &+ \beta_2 x_2 &+ \beta_3 x_3 &+ \beta_4 x_4\\ \mu &= \beta_0 &+ \beta_1 x_1 &+ \beta_2 x_2 & \\ \mu &= & \beta_1 x_1 &+ \beta_2 x_2 x_3 &+ \beta_3 x_3\\ 1/\mu &= \beta_0 &+ \beta_1 x_1 &+ \beta_2 x_2 &\\ \log(\mu) &= \beta_0 &+ \beta_1 x_1 &+ \beta_2 x_2 &+ \beta_3 x_3\\ \mu &= \beta_0 &+ \beta_1 x_1^2 &+ \exp(\beta_2 x_2^4) &+ \beta_3 x_3 & + \beta_4 x_4 \end{align} \]

A GLM

Modeling Counts

In a Poisson process, we counting the number of events per unit of time or space. We assume:

  • The events are independent (or nearly so)
  • The events are discrete
  • The number of events depends only on the length or size of the time interval under consideration.

Examples?

The Poisson Probability Function

Definition 4 (Poisson Probability Function) Given the expected count \(\mu > 0\) for a Poisson process occurring over a unit of time/space, the probability that \(y\) events will occur in one of the units of time/space is given by the function

\[ \mathcal{P}(y; \mu) = \frac{\exp(-\mu) \mu^y}{y!}. \qquad(3)\]

dpois(7,10) # The probability of 7 tickets given 10 is typical
[1] 0.09007923
sum(dpois(0:7,lambda=10)) # The probability of 7 or fewer tickets given 10 is typical
[1] 0.2202206
ppois(7,10) # The probability of 7 or fewer tickets given 10 is typical
[1] 0.2202206

Example

Code
library(GLMsData)
data(nminer)
names(nminer)
[1] "Miners"  "Eucs"    "Area"    "Grazed"  "Shrubs"  "Bulokes" "Timber" 
[8] "Minerab"
Code
plot(Minerab ~ Eucs, 
     data=nminer, 
     las=1, 
     ylim=c(0,20), 
     xlab="Number of eucalypts per 2 ha", 
     ylab="Number of noisy miners")

Code
plot(jitter(Minerab) ~ Eucs, 
     data=nminer, 
     las=1, 
     ylim=c(0,20), 
     xlab="Number of eucalypts per 2 ha", 
     ylab="Number of noisy miners")

A possible Poisson GLM for this data is

\[ \begin{cases} y\sim \text{Pois}(\mu) & \text{(random component)}\\ \mu = \exp(\beta_0 + \beta_1 x) & \text{(systematic component)} \end{cases} \qquad(4)\]

where \(y\) is the number of noisy miners in a given hectare, and \(x\) is the number of eucalpyt trees at the given location.

Interpreting Regression Models

Linear vs. Exponential

Question Given a one-unit change in \(x\), how should each of the two systematic components in Equation 5 be interpreted to describe the corresponding change in \(\mu\)?

\[ \begin{align} \mu &= \beta_0 + \beta_1 x\\ \log \mu &= \beta_0 + \beta_1 x \end{align} \qquad(5)\]

Further Considerations

  • Be careful with multiple explanatory variables!
  • “[A]ll models are wrong…’’
  • “…but some are useful. However, the approximate nature of the model must always be borne in mind.” –Box and Draper
  • Motivation
  • Accuracy vs Parsimony
  • Causality vs Association
  • Generalizability

If time

Choose a dataset from the GLMsData package and write a rough draft of a regression model. Justify your choice of systematic component.