Overdispersion and Goodness-of-Fit

Stat 203 Lecture 27

Dr. Janssen

Overdispersion

The idea

For a binomial distribution, \(\text{var}[y] = \mu (1-\mu)\). In practice, the amount of variation can exceed this quantity, even for binomial-like data.

This is called overdispersion.

Estimating \(\phi\)

MLE?

Example: in normal linear regression, the MLE of \(\phi = \sigma^2\) is

\[ \hat{\sigma}^2 = \frac{1}{n} \sum\limits_{i=1}^n w_i (y_i - \hat{\mu}_i)^2, \]

which is never used as it is biased.

Instead:

\[ s^2 = \frac{1}{n-p'}\sum\limits_{i=1}^n w_i (y_i - \hat{\mu}_i)^2. \]

(Modified) Profile Log-Likelihood Estimator

The profile log-likelihood for \(\phi\) is

\[ \ell(\phi) = \ell(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p, \phi; y) \]

The modified profile log-likelihood is

\[ \ell^0(\phi) = \frac{p'}{2} \log(\phi) + \ell(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p, \phi; y). \]

The profile log-likelihood estimate for \(\phi\) is found by first assuming \(\phi\) is fixed and maximizing the log-likelihood with respect to \(\beta\).

Write the log-likelihood as \(\ell(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p, \phi; y)\). Then treat it as a function of \(\phi\), with each \(\hat{\beta}\) fixed and maximize this log-likelihood wrt \(\phi\).

MPL has better properties: penalty term which penalizes low values of \(\phi\) and large numbers of parameters.

The mpl estimator \(\hat{\phi}^0\) is a consistent estimator and is approximately unbiased, even in quite small samples.

The main disadvantage of the mpl estimator is that, like the mle, it is often inconvenient to compute. The estimator generally requires iterative estimation (as usual, the normal linear case is an exception). Even more seriously, the derivatives of \(\ell\) wrt \(\phi\) involve the terms \(\partial a(y,\phi/w)/ \partial \phi\), which may be difficult to obtain, particularly if \(a\) does not have a closed form.

Example 6.8 on p. 254 verifies that we’ve generalized the normal linear case.

Mean Deviance Estimator of \(\phi\)

Another approach is to use the mean deviance estimator of \(\phi\):

\[ \tilde{\phi} = \frac{D(y,\hat{\mu})}{n-p'}. \]

Pearson Estimator of \(\phi\)

The Pearson statistic is the working RSS:

\[ X^2 = \sum\limits_{i=1}^n w_i (z_i - \hat{\eta}_i)^2 = \sum\limits_{i=1}^n \frac{w_i (y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)}. \]

The Pearson estimator of \(\phi\) is then

\[ \overline{\phi} = \frac{X^2}{n-p'}. \]

Which is best?

\(\hat{\phi}\) is biased and so is rarely used
\(\hat{\phi}^0\) has excellent theoretical properties but is hard to compute
Mean deviance and Pearson estimator are convenient
Mean deviance estimator behaves well when saddlepoint holds
Pearson estimator is almost universally applicable

Goodness-of-fit

The idea

Compare Model A to an alternative Model B of a particular type, typically the largest model we can fit to the data (a saturated model).

If goodness-of-fit is rejected, this is evidence that the current model is not accurate.

Pearson Goodness-of-Fit Test

Large-sample asymptotics do not apply, because the number of parameters in the saturated model increases with the number of observations..

Some rules-of-thumb for small dispersion asymptotics are given on p. 277; in these cases, the Pearson statistic for goodness-of-fit are approximately chi-square.

Back to binomial

Two causes

The probabilities \(\mu_i\) are not constant between observations even when all the explanatory variables are unchanged.
Alternatively, the \(m_i\) cases, of which observation \(y_i\) is a proportion, are not independent.

Lack of independence

Example: positive cases arrive in clusters rather than as individual cases.

Instead, writing \(\rho\) for the correlation between the Bernoulli trials, we find \(\phi_i = 1 + (m_i-1)\rho\).