Models for Proportions

Stat 203 Lecture 24

Dr. Janssen

Setup

Warmup

For the following research questions, describe the explanatory and response variables, and classify each type response as numerical, binomial, or a count.

What is the chance you are accepted into medical school given your GPA and MCAT scores?
Is a single mom more likely to marry the baby’s father if she has a boy?
Are the number of motorcycle deaths in a given year related to a state’s helmet laws?
A random selection of women aged 20-24 years are selected; can their shoe size be used to predict their heights?
Are students participating in sports in college more or less likely to graduate?
Does the number of employers conducting on-campus interviews during a year differ for public and private colleges?
Does the temperature affect the number of chirps per minute of a cricket?

Proportions are popular

Examples of responses that are proportions?

For these examples, a binomial distribution may be appropriate for modeling the responses:

The \(m\) individuals in each group are assumed independent
Each individual can be classified into one of two possible outcomes

Recall

The binomial probability function is

\[ \mathcal{P}(y; \mu, m) = \binom{m}{my}\mu^{my} (1-\mu)^{m(1-y)} \]

where:

\(m\) is known (group size)
\(\phi = 1\)
\(y = 0, 1/m, 2/m, \ldots, 1\)
expected proportion is \(0 < \mu < 1\)

Unit deviance

For the binomial distribution, the unit deviance is

\[ d(y,\mu) = 2 \left( y\log \frac{y}{\mu} + (1-y) \log \frac{1-y}{1-\mu} \right) \]

When \(y = 0, 1\) the limit form of the unit deviance is used.

Residual deviance

Interlude on estimating parameters (Ch. 6).

After computing the MLEs \(\hat{\beta}_j\) and corresponding fitted values \(\hat{\mu}\), the residual deviance is the minimized total deviance

\[ D(y,\hat{\mu}) = \sum\limits_{i=1}^n w_i d(y_i, \hat{\mu}_i). \]

For binomial GLMs:

\[ D(y,\hat{\mu}) = \sum_{i=1}^n m_i d(y_i,\hat{\mu}_i) \sim \chi_{n-p'}^2 \]

for a model with \(p'\) parameters in the linear predictor.

Example: `turbines`

library(GLMsData); data(turbines)
tur.m1 <- glm( Fissures/Turbines ~ Hours, family=binomial,
               weights=Turbines, data=turbines)
tur.m2 <- glm( cbind(Fissures, Turbines-Fissures) ~ Hours,
               family=binomial, data=turbines)

An experiment running turbines for various lengths of time [19, 20] recorded the proportion of turbine wheels \(y_i\) out of a total of \(m_i\) turbines developing fissures (narrow cracks).

Three ways to specify the response:

as the observed proportions \(y_i\), when the sample sizes \(m_i\) are specified as the weights in the glm() call (lines 2-3)
As a two-column array, the columns giving the numbers of successes and failures in each group of size \(m_i\). Weights do not need to be supplied (R will just add the successes and failures in each row) (lines 4-5; cbind makes a 2-column array in this case)
As a factor, where the first level corresponds to failures, and others to success; or as logicals. The prior weights do not need to be supplied. Useful if one row for each observation. (not showed here; the data would need to be arranged in \(\sum_{i=1}^n m_i = 432\) rows)

Link Functions

Logit

The logit (or logistic) link function is the canonical link for the binomial distribution and the default in R:

\[ \eta = \log \frac{\mu}{1-\mu} = \text{logit}(\mu). \]

A binomial GLM with the logit link function is often called a logistic regression model. It is specified in R using link ="logit".

Probit

The probit link function is

\[ \eta = \Phi^{-1}(\mu) = \text{probit}(\mu), \]

where \(\Phi\) is the CDF for the normal distribution. This link is specified in R as link="probit".

Complementary log-log

The complementary log-log link function is

\[ \eta = \log\left( - \log(1-\mu)\right). \]

This is specified in R as link="cloglog".

Example: `turbines`

library(GLMsData); data(turbines)
tr.logit  <- glm( Fissures/Turbines ~ Hours, data=turbines,
                   family=binomial, weights=Turbines)
tr.probit <- update( tr.logit, family=binomial(link="probit") )
tr.cll    <- update( tr.logit, family=binomial(link="cloglog") )

Example: `turbines`

library(GLMsData); data(turbines)
tr.logit  <- glm( Fissures/Turbines ~ Hours, data=turbines,
                   family=binomial, weights=Turbines)
tr.probit <- update( tr.logit, family=binomial(link="probit") )
tr.cll    <- update( tr.logit, family=binomial(link="cloglog") )
tr.array <- rbind( coef(tr.logit), coef(tr.probit), coef(tr.cll))
tr.array <- cbind( tr.array, c(deviance(tr.logit),
                     deviance(tr.probit), deviance(tr.cll)) )
colnames(tr.array) <- c("Intercept", "Hours","Residual dev.")
rownames(tr.array) <- c("Logit","Probit","Comp log-log")
tr.array

             Intercept        Hours Residual dev.
Logit        -3.923597 0.0009992372     10.331466
Probit       -2.275807 0.0005783211      9.814837
Comp log-log -3.603280 0.0008104936     12.227914

Example: `turbines`

Code

newHrs <- seq( 0, 5000, length=100)
newdf <- data.frame(Hours=newHrs)
newP.logit  <- predict( tr.logit,  newdata=newdf, type="response")
newP.probit <- predict( tr.probit, newdata=newdf, type="response")
newP.cll    <- predict( tr.cll,    newdata=newdf, type="response")
plot( Fissures/Turbines ~ Hours, data=turbines, pch=19, las=1,
        xlim=c(0, 5000), ylim=c(0, 0.7),
        xlab="Hours run", ylab="Proportion with fissures")
lines( newP.logit  ~ newHrs, lty=1, lwd=2)
lines( newP.probit ~ newHrs, lty=2, lwd=2)
lines( newP.cll    ~ newHrs, lty=4, lwd=2)
legend("topleft", lwd=2, lty=c(1, 2, 4),
         legend=c("Logit","Probit","Comp. log-log"))

Models for Proportions

Setup

Warmup

Proportions are popular

Recall

Unit deviance

Residual deviance

Example: turbines

Link Functions

Logit

Probit

Complementary log-log

Example: turbines

Example: turbines

Example: turbines

Example: `turbines`

Example: `turbines`

Example: `turbines`

Example: `turbines`