Regression Models

Stat 203 Lecture 3

Dr. Janssen

What are they?

General Definition and Notation

Definition 1 A regression model is a statistical model that assumes the mean response \(\mu_i\) for observation \(i\) depends on the \(p\) explanatory variables \(x_{1i}, x_{2i}, \ldots, x_{pi}\) via some general function \(f\) through a number of regression parameters \(\beta_j\):

\[ E[y_i] = \mu_i = f(x_{1i}, \ldots,x_{pi}; \beta_0, \beta_1, \ldots, \beta_q). \]

More to the point (for this class):

\[ \mu_i = f(\beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi}) \qquad(1)\]

Regression models with the form given in Equation 1 are said to be linear in the parameters. All the models we discuss this semester have this form.

The component \(\beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi}\) is called the linear predictor.

Two Special Types

Definition 2 (Linear Regression) The systematic component of a linear regression model assumes the form

\[ E[y_i] = \mu_i = \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi}, \] while the randomness is assumed to have constant variance \(\sigma^2\) about \(\mu_i\).

Definition 3 (Generalized Linear Model) The systematic component of a generalized linear model assumes the form

\[ \begin{align*} \mu_i &= g^{-1}(\beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi})\\ \text{or: } g(\mu_i) &= \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi}, \end{align*} \] where \(g\), called the link function, is a monotonic, differentiable function.

The randomness is given via a specific family of probability distributions, chosen based on the situation.

Notational Conventions

The following notational conventions apply.

The number of explanatory variables is \(p\): \(x_1, x_2,\ldots, x_p\).
The number of regression parameters is \(p'\).
If the systematic component has a constant term \(\beta_0\), then \(p' = p + 1\). Otherwise \(p' = p\).

Example: `lime`

Recall: lime contains data on 385 small-leaved lime trees grown in Russia.

\(y\), Foliage: the foliage biomass, in kg (oven dried matter)
\(x_1\), DBH: the tree diameter, at breast height, in cm
\(x_2\), Age: the age of the tree, in years
\(x_3, x_4\), Origin: the origin of the tree; one of Coppice, Natural, Planted

One potential linear regression model is: \[ \begin{cases} \text{var}[y_i] = \sigma^2 & \text{(random component)}\\ \mu_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 & \text{(systematic component)} \end{cases} \qquad(2)\]

Other possible systematic components?

\[ \begin{align} \mu &= \beta_0 &+ \beta_1 x_1 & & & \\ \mu &= \beta_0 & &+ \beta_2 x_2 &+ \beta_3 x_3 &+ \beta_4 x_4\\ \mu &= \beta_0 &+ \beta_1 x_1 &+ \beta_2 x_2 & \\ \mu &= & \beta_1 x_1 &+ \beta_2 x_2 x_3 &+ \beta_3 x_3\\ 1/\mu &= \beta_0 &+ \beta_1 x_1 &+ \beta_2 x_2 &\\ \log(\mu) &= \beta_0 &+ \beta_1 x_1 &+ \beta_2 x_2 &+ \beta_3 x_3\\ \mu &= \beta_0 &+ \beta_1 x_1^2 &+ \exp(\beta_2 x_2^4) &+ \beta_3 x_3 & + \beta_4 x_4 \end{align} \]

A GLM

Modeling Counts

In a Poisson process, we counting the number of events per unit of time or space. We assume:

The events are independent (or nearly so)
The events are discrete
The number of events depends only on the length or size of the time interval under consideration.

Examples?

The Poisson Probability Function

Definition 4 (Poisson Probability Function) Given the expected count \(\mu > 0\) for a Poisson process occurring over a unit of time/space, the probability that \(y\) events will occur in one of the units of time/space is given by the function

\[ \mathcal{P}(y; \mu) = \frac{\exp(-\mu) \mu^y}{y!}. \qquad(3)\]

dpois(7,10) # The probability of 7 tickets given 10 is typical

[1] 0.09007923

sum(dpois(0:7,lambda=10)) # The probability of 7 or fewer tickets given 10 is typical

[1] 0.2202206

ppois(7,10) # The probability of 7 or fewer tickets given 10 is typical

[1] 0.2202206

Example

Code

library(GLMsData)
data(nminer)
names(nminer)

[1] "Miners"  "Eucs"    "Area"    "Grazed"  "Shrubs"  "Bulokes" "Timber" 
[8] "Minerab"

Code

plot(Minerab ~ Eucs, 
     data=nminer, 
     las=1, 
     ylim=c(0,20), 
     xlab="Number of eucalypts per 2 ha", 
     ylab="Number of noisy miners")

Code

plot(jitter(Minerab) ~ Eucs, 
     data=nminer, 
     las=1, 
     ylim=c(0,20), 
     xlab="Number of eucalypts per 2 ha", 
     ylab="Number of noisy miners")

A possible Poisson GLM for this data is

\[ \begin{cases} y\sim \text{Pois}(\mu) & \text{(random component)}\\ \mu = \exp(\beta_0 + \beta_1 x) & \text{(systematic component)} \end{cases} \qquad(4)\]

where \(y\) is the number of noisy miners in a given hectare, and \(x\) is the number of eucalpyt trees at the given location.

Interpreting Regression Models

Linear vs. Exponential

Question Given a one-unit change in \(x\), how should each of the two systematic components in Equation 5 be interpreted to describe the corresponding change in \(\mu\)?

\[ \begin{align} \mu &= \beta_0 + \beta_1 x\\ \log \mu &= \beta_0 + \beta_1 x \end{align} \qquad(5)\]

Further Considerations

Be careful with multiple explanatory variables!
“[A]ll models are wrong…’’
“…but some are useful. However, the approximate nature of the model must always be borne in mind.” –Box and Draper
Motivation
Accuracy vs Parsimony
Causality vs Association
Generalizability

A further consideration arises when you have multiple explanatory variables. They may interact! And a unit increase in one variable may not lead to a predictable increase in the response, as the other explanatory variable(s) may increase/decrease, etc.

Motivation: trying to make predictions or understand how variables relate to one another? Could lead to different models.

Given a dataset, one could often choose one of many systematic components. Typically less complicated models are less accurate, but more accurate models could be hard to interpret and/or overfit. A typical aim is to tell “a persuasive story parsimoniously” (frugally).

Last two: how were our data collected? Via a randomized study, in which case we may control for confounding variables and establish causality? Via observation, and so we may only establish association? Randomly from a population? Or from a particular group at a particular time?

If time

Choose a dataset from the GLMsData package and write a rough draft of a regression model. Justify your choice of systematic component.

Regression Models

What are they?

General Definition and Notation

Two Special Types

Notational Conventions

Example: lime

A GLM

Modeling Counts

The Poisson Probability Function

Example

Interpreting Regression Models

Linear vs. Exponential

Further Considerations

If time

Example: `lime`