Models for Proportions

Stat 203 Lecture 24

Dr. Janssen

Setup

Warmup

For the following research questions, describe the explanatory and response variables, and classify each type response as numerical, binomial, or a count.

  • What is the chance you are accepted into medical school given your GPA and MCAT scores?
  • Is a single mom more likely to marry the baby’s father if she has a boy?
  • Are the number of motorcycle deaths in a given year related to a state’s helmet laws?
  • A random selection of women aged 20-24 years are selected; can their shoe size be used to predict their heights?
  • Are students participating in sports in college more or less likely to graduate?
  • Does the number of employers conducting on-campus interviews during a year differ for public and private colleges?
  • Does the temperature affect the number of chirps per minute of a cricket?

Recall

The binomial probability function is

\[ \mathcal{P}(y; \mu, m) = \binom{m}{my}\mu^{my} (1-\mu)^{m(1-y)} \]

where:

  • \(m\) is known (group size)
  • \(\phi = 1\)
  • \(y = 0, 1/m, 2/m, \ldots, 1\)
  • expected proportion is \(0 < \mu < 1\)

Unit deviance

For the binomial distribution, the unit deviance is

\[ d(y,\mu) = 2 \left( y\log \frac{y}{\mu} + (1-y) \log \frac{1-y}{1-\mu} \right) \]

When \(y = 0, 1\) the limit form of the unit deviance is used.

Residual deviance

Interlude on estimating parameters (Ch. 6).

After computing the MLEs \(\hat{\beta}_j\) and corresponding fitted values \(\hat{\mu}\), the residual deviance is the minimized total deviance

\[ D(y,\hat{\mu}) = \sum\limits_{i=1}^n w_i d(y_i, \hat{\mu}_i). \]

For binomial GLMs:

\[ D(y,\hat{\mu}) = \sum_{i=1}^n m_i d(y_i,\hat{\mu}_i) \sim \chi_{n-p'}^2 \]

for a model with \(p'\) parameters in the linear predictor.

Example: turbines

library(GLMsData); data(turbines)
tur.m1 <- glm( Fissures/Turbines ~ Hours, family=binomial,
               weights=Turbines, data=turbines)
tur.m2 <- glm( cbind(Fissures, Turbines-Fissures) ~ Hours,
               family=binomial, data=turbines)

Logit

The logit (or logistic) link function is the canonical link for the binomial distribution and the default in R:

\[ \eta = \log \frac{\mu}{1-\mu} = \text{logit}(\mu). \]

A binomial GLM with the logit link function is often called a logistic regression model. It is specified in R using link ="logit".

Probit

The probit link function is

\[ \eta = \Phi^{-1}(\mu) = \text{probit}(\mu), \]

where \(\Phi\) is the CDF for the normal distribution. This link is specified in R as link="probit".

Complementary log-log

The complementary log-log link function is

\[ \eta = \log\left( - \log(1-\mu)\right). \]

This is specified in R as link="cloglog".

Example: turbines

library(GLMsData); data(turbines)
tr.logit  <- glm( Fissures/Turbines ~ Hours, data=turbines,
                   family=binomial, weights=Turbines)
tr.probit <- update( tr.logit, family=binomial(link="probit") )
tr.cll    <- update( tr.logit, family=binomial(link="cloglog") )

Example: turbines

library(GLMsData); data(turbines)
tr.logit  <- glm( Fissures/Turbines ~ Hours, data=turbines,
                   family=binomial, weights=Turbines)
tr.probit <- update( tr.logit, family=binomial(link="probit") )
tr.cll    <- update( tr.logit, family=binomial(link="cloglog") )
tr.array <- rbind( coef(tr.logit), coef(tr.probit), coef(tr.cll))
tr.array <- cbind( tr.array, c(deviance(tr.logit),
                     deviance(tr.probit), deviance(tr.cll)) )
colnames(tr.array) <- c("Intercept", "Hours","Residual dev.")
rownames(tr.array) <- c("Logit","Probit","Comp log-log")
tr.array
             Intercept        Hours Residual dev.
Logit        -3.923597 0.0009992372     10.331466
Probit       -2.275807 0.0005783211      9.814837
Comp log-log -3.603280 0.0008104936     12.227914

Example: turbines

Code
newHrs <- seq( 0, 5000, length=100)
newdf <- data.frame(Hours=newHrs)
newP.logit  <- predict( tr.logit,  newdata=newdf, type="response")
newP.probit <- predict( tr.probit, newdata=newdf, type="response")
newP.cll    <- predict( tr.cll,    newdata=newdf, type="response")
plot( Fissures/Turbines ~ Hours, data=turbines, pch=19, las=1,
        xlim=c(0, 5000), ylim=c(0, 0.7),
        xlab="Hours run", ylab="Proportion with fissures")
lines( newP.logit  ~ newHrs, lty=1, lwd=2)
lines( newP.probit ~ newHrs, lty=2, lwd=2)
lines( newP.cll    ~ newHrs, lty=4, lwd=2)
legend("topleft", lwd=2, lty=c(1, 2, 4),
         legend=c("Logit","Probit","Comp. log-log"))