Comparing Models

Stat 203 Lecture 8

Author

Dr. Janssen

The Problem

When we’re handed a dataset, building a model is fairly straightforward. Building the best model is harder–how do you compare two distinct models? How do you compare several (perhaps, many)? This is the question we’ll explore here.

Two nested models

Definition 1 We say linear regression model A is nested in linear regression model B if model A can be obtained from model B by setting one or more parameters in model B to zero.

As we have seen, increasing the number of explanatory variables does tend to increase the explanatory power of the model, but at the cost of additional complexity from the reduction in the degrees of freedom. How can we tell if this increased explanatory power, described by a reduction in the RSS, is statistically significant?

First, we calculate:

  • The difference in degrees of freedom from model A to model B, \(\Delta\text{df}\)
  • The sum-of-squares, which is the reduction in RSS from A to B: \(\text{SS} = \text{RSS}(A) - \text{RSS}(B)\)
  • The RSS of \(B\) and the degrees of freedom of \(B\), \(\text{RSS}(B)\) and \(\text{df}(B)\)

Then, we calculate an \(F\)-statistic:

\[ F = \frac{\text{SS}/\Delta\text{df}}{\text{RSS}(B)/\text{df}(B)}. \]

We then compare \(F\) to an \(F\)-distribution with \((\Delta\text{df},\text{df}(B))\) degrees of freedom, using, say, pf(). If the \(p\)-value suggests significance, then model B is a significant improvement over A. If we have created two linear models model_A and model_B in R, we can use anova() to do the analysis: anova(model_A, model_B).

Exploration

You recently built a model to predict the quantity of insurance that a person may want to buy. During that process, you made some choices about which variable(s) to include. Explore two nested models you considered use the process above to determine whether the tradeoff of increased complexity of the larger model was warranted given the significance of the calculated \(F\)-statistic. Justify your conclusion.

Sequential analysis of variance

More generally, we may wish to compare a sequence of nested models. This is most easily illustrated via an example.

library(GLMsData); data(lungcap);

LC.0 <- lm(log(FEV) ~ 1, data=lungcap) # no explanatory variables
LC.1 <- update(LC.0, . ~ . + Age)      # Age
LC.2 <- update(LC.1, . ~ . + Ht)       # Age and Height
LC.3 <- update(LC.2, . ~ . + Gender)   # Age, Height, and Gender
LC.4 <- update(LC.3, . ~ . + Smoke)    # Then, add smoking status

A brief word on the syntax: the update() function takes as its first argument the model to be updated, and then the model specification to change. The first dot indicates that the left-hand side should stay the same, while the second dot indicates that we should keep the right-hand side from the model we’re updating, before adding the next explanatory variable.

We then calculate an anova() table using R:

anova(LC.4)
Analysis of Variance Table

Response: log(FEV)
           Df Sum Sq Mean Sq   F value    Pr(>F)    
Age         1 43.210  43.210 2041.9564 < 2.2e-16 ***
Ht          1 15.326  15.326  724.2665 < 2.2e-16 ***
Gender      1  0.153   0.153    7.2451  0.007293 ** 
Smoke       1  0.103   0.103    4.8537  0.027937 *  
Residuals 649 13.734   0.021                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output of the ANOVA table is the result of a series of sequential tests performed. The last one formally tests is Smoke is significant in the model, given that Age, Ht, and Gender are already in the model. In other words, the \(F\)-test adjusts for the previously-added variables. This means that the order in which the variables are added to the model matters! It may be that a variable is significant when added first, but not when added later and adjusted for the variables already present.

Exploration

Compare a sequence of at least four nested models on the Term dataset. What do you observe? What do you wonder?

Parallel and Independent Regression Models

Parallel Models

Many (most?) of the models we explore will contain dummy variables representing some factor. We can use these variables to produce parallel systematic components which are different based on the factor level.

Example 1 In the lungcap data, we can produce a model for \(\mu = \log(y)\) based on height:

\[ \hat{\mu} = -2.271 + 0.05212 x_2. \]

We might wonder if smoking status impacts the relationship between height and lung capacity. We could add a variable to the model:

\[ \hat{\mu} = -2.277 + 0.05222 x_2 - 0.006830 x_4. \]

Alternatively, we can let \(x_4 = 1\) to produce a systematic component for smokers, and let \(x_4 = 0\) produce a systematic component for non-smokers. (Note we make no changes to the assumed random component of the model.)

We get the following systematic components:

\[ \hat\mu = \begin{cases} -2.277 + 0.05222 x_2 \text{ for non-smokers}\\ -2.284 + 0.05222 x_2 \text{ for smokers}. \end{cases} \]

Note that the slopes are the same, but the intercepts are (slightly) different. Are they different enough to warrant the choice of parallel regression lines? How could we check?

Well, note that regardless of the value we choose for \(x_4\), if the coefficient on \(x_4\) is zero, the intercepts for the two different lines are equal, so we need a formal test of the significance of \(\beta_4\), which we do as follows:

printCoefmat(coef(summary(lm( log(FEV) ~ Ht + Smoke, data=lungcap))))
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -2.2767801  0.0656677 -34.6712   <2e-16 ***
Ht           0.0522196  0.0010785  48.4174   <2e-16 ***
Smoke       -0.0068303  0.0205450  -0.3325   0.7397    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thus the evidence suggests that different intercepts are not needed.

Exploration

Choose a factor in the Term dataset and create a parallel model. Is there evidence that the parallel regression models are needed?

Question

Why does it make sense to create parallel models based on values of categorical explanatory variables and not quantitative explanatory variables?

Independent Models

In addition to adjusting the intercepts based on values of factors, we can explore adjustments to the slopes via interactions of factors and covariates. In the lungcap dataset, this could look like:

\[ \hat\mu = -2.281 + 0.05230 x_2 - 0.002294 x_4 + 0.002294 x_2.x_4, \tag{1}\]

where \(x_2.x_4\) denotes the interaction between height and smoking status. We can use Equation 1 to produce two separate systematic components via the two factor levels of \(x_4\) (\(=0\) for non-smokers, and \(=1\) for smokers):

\[ \hat\mu = \begin{cases} -2.281 + 0.05230x_2 \text{ for non-smokers}\\ -2.137 + 0.05000x_2 \text{ for smokers}.\end{cases} \]

We thus see that the lines are no longer parallel: the slopes and intercepts have been adjusted. One can create this model in R:

LC.model <- lm( log(FEV) ~ Ht + Smoke + Ht:Smoke, data=lungcap)
Exploration

If you have not already done so, explore interactions between explanatory variables in your model of the Term data. Are there interactions which should be included?

Question

Does it make sense to include the interaction between two variables without each of the variables themselves being an independent part of the model? For instance, would it make sense to include the \(x_2 . x_4\) term in Equation 1 without independnet terms for \(x_2\) or \(x_4\)? Explain.

Comparing Non-nested models

We have thus far explored comparing nested models, that is, models which could be considered special case(s) of one another. Often we want to compare models which are not so closely related. Recall that there are two main criteria for model selection: accuracy and parsimony. We measure accuracy via RSS (lower is better), though the inclusion of additional explanatory variables will never cause RSS to increase, and almost always causes it to decrease. At the same time, including more explanatory variables does cause an increase in complexity. How can we balance these competing desires?

Definition 2 (Akaike’s Information Criterion (AIC)) For a normal linear regression model \(p'\) parameters,

\[ \text{AIC} = n\log(\text{RSS}/n) + 2p' \tag{2}\]

The term \(2p'\) in Equation 2 is called the penalty, as it penalizes the more complex linear regression models by a facdtor of \(k = 2\).

Note that the value of AIC is not meaningful; it’s only useful for comparing models against one another. Similar quantities are also defined, such as the Bayesian Information Criterion (BIC):

\[ \text{BIC} = n\log(\text{RSS}/n) + p' \log n. \tag{3}\]

Question

For a fixed \(n \gg 0\), which penalizes an additional parameter more: AIC or BIC? What if \(n\) is small?

You can calculate both AIC and BIC in R via the extractAIC() function. The AIC is returned by default, whereas the BIC can be returned by specifying k=log(nobs(fit)), where fit is the name of the fitted model and nobs() extracts the number of observations used to fit the model.

Additional Tools for Model Selection

The text explores additional tools for model selection in Section 2.12. You can read about them there.

Case Study

Explore the ants dataset in the GLMsData package. Explore different (nested and non-nested) models, and compare them with the tools we discussed above. Be prepared to present your model and explain how you determined which variables and interactions to include.

Appendix: Exploring the Term dataset

In what follows, we’ll explore the term dataset. About this dataset (source):

Like all firms, life insurance companies continually seek new ways to deliver products to the market. Those involved in product development wish to know “who buys insurance and how much do they buy?” Analysts can readily get information on characteristics of current customers through company databases. Potential customers, those that do not have insurance with the company, are often the main focus for expanding market share. We examine the Survey of Consumer Finances (SCF), a nationally representative sample that contains extensive information on assets, liabilities, in- come, and demographic characteristics of those sampled (potential U.S. customers). We study a random sample of 500 households with positive incomes that were interviewed in the 2004 survey. For term life insurance, the quantity of insurance is measured by the policy FACE, the amount that the company will pay in the event of the death of the named insured.

Term <- read.csv("https://prof.mkjanssen.org/glm/data/TermLife.csv", header=TRUE) # import the data
#  PICK THE SUBSET OF THE DATA CORRESPONDING TO TERM PURCHASE
Term2 <- subset(Term, subset=FACE > 0)

The variables are:

  • GENDER: The gender of the survey respondendent
  • AGE: Age of the survey respondent
  • MARSTAT: Marital status of the survey respondent (=1 if married, =2 if living with partner, and =0 otherwise)
  • EDUCATION: Number of years of education of the survey respondent
  • ETHNICITY: Ethnicity
  • SMARSTAT: Marital status of the respondent’s spouse
  • SGENDER: Gender of the respondent’s spouse
  • SAGE: Age of the respondent’s spouse
  • SEDUCATION: Education of the respondent’s spouse
  • NUMHH: Number of household members
  • INCOME: Annual income of the familiy
  • CHARITY: Charitable contributions
  • FACE: Amount that the company will pay in the event of the death of the named insured
  • FACECVLIFEPOLICIES: Face amount of life insurance policy with a cash value
  • CASHCVLIFEPOLICIES: Cash value of life insurance policy with a cash value
  • BORROWCVLIFEPOL: Amount borrowed on life insurance policy with a cash value
  • NETVALUE: Net amount at risk on life insurance policy with a cash value

Your job: work with up to one other person to develop a multiple linear regression model to predict the amount of coverage a customer may buy. Make sure you do proper exploratory data analysis (you may find you want to do a log-transform on one or more variables!). Identify significance levels for your parameter estimates, as well as confidence intervals for the parameters and fitted value for representative customers. You should also explain the proportion of the variation predicted by your model compared to the variation attributed to the random component.