```
library(GLMsData); data(lungcap);
.0 <- lm(log(FEV) ~ 1, data=lungcap) # no explanatory variables
LC.1 <- update(LC.0, . ~ . + Age) # Age
LC.2 <- update(LC.1, . ~ . + Ht) # Age and Height
LC.3 <- update(LC.2, . ~ . + Gender) # Age, Height, and Gender
LC.4 <- update(LC.3, . ~ . + Smoke) # Then, add smoking status LC
```

# Comparing Models

Stat 203 Lecture 8

# The Problem

When we’re handed a dataset, building a model is fairly straightforward. Building the *best* model is harder–how do you compare two distinct models? How do you compare several (perhaps, many)? This is the question we’ll explore here.

## Two nested models

**Definition 1 **We say linear regression model A is *nested* in linear regression model B if model A can be obtained from model B by setting one or more parameters in model B to zero.

As we have seen, increasing the number of explanatory variables does tend to increase the explanatory power of the model, but at the cost of additional complexity from the reduction in the degrees of freedom. How can we tell if this increased explanatory power, described by a reduction in the RSS, is statistically significant?

First, we calculate:

- The difference in degrees of freedom from model A to model B, \(\Delta\text{df}\)
- The sum-of-squares, which is the reduction in RSS from A to B: \(\text{SS} = \text{RSS}(A) - \text{RSS}(B)\)
- The RSS of \(B\) and the degrees of freedom of \(B\), \(\text{RSS}(B)\) and \(\text{df}(B)\)

Then, we calculate an \(F\)-statistic:

\[ F = \frac{\text{SS}/\Delta\text{df}}{\text{RSS}(B)/\text{df}(B)}. \]

We then compare \(F\) to an \(F\)-distribution with \((\Delta\text{df},\text{df}(B))\) degrees of freedom, using, say, `pf()`

. If the \(p\)-value suggests significance, then model B is a significant improvement over A. If we have created two linear models `model_A`

and `model_B`

in `R`

, we can use `anova()`

to do the analysis: `anova(model_A, model_B)`

.

## Exploration

You recently built a model to predict the quantity of insurance that a person may want to buy. During that process, you made some choices about which variable(s) to include. Explore two nested models you considered use the process above to determine whether the tradeoff of increased complexity of the larger model was warranted given the significance of the calculated \(F\)-statistic. Justify your conclusion.

## Sequential analysis of variance

More generally, we may wish to compare a *sequence* of nested models. This is most easily illustrated via an example.

A brief word on the syntax: the `update()`

function takes as its first argument the model to be updated, and then the model specification to change. The first dot indicates that the left-hand side should stay the same, while the second dot indicates that we should keep the right-hand side from the model we’re updating, before adding the next explanatory variable.

We then calculate an `anova()`

table using `R`

:

`anova(LC.4)`

```
Analysis of Variance Table
Response: log(FEV)
Df Sum Sq Mean Sq F value Pr(>F)
Age 1 43.210 43.210 2041.9564 < 2.2e-16 ***
Ht 1 15.326 15.326 724.2665 < 2.2e-16 ***
Gender 1 0.153 0.153 7.2451 0.007293 **
Smoke 1 0.103 0.103 4.8537 0.027937 *
Residuals 649 13.734 0.021
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The output of the ANOVA table is the result of a series of sequential tests performed. The *last* one formally tests is `Smoke`

is significant in the model, given that `Age`

, `Ht`

, and `Gender`

are already in the model. In other words, the \(F\)-test adjusts for the previously-added variables. This means that the order in which the variables are added to the model matters! It may be that a variable is significant when added first, but not when added later and adjusted for the variables already present.

## Exploration

Compare a sequence of at least four nested models on the `Term`

dataset. What do you observe? What do you wonder?

## Parallel and Independent Regression Models

### Parallel Models

Many (most?) of the models we explore will contain dummy variables representing some factor. We can use these variables to produce parallel systematic components which are different based on the factor level.

**Example 1 **In the `lungcap`

data, we can produce a model for \(\mu = \log(y)\) based on height:

\[ \hat{\mu} = -2.271 + 0.05212 x_2. \]

We might wonder if smoking status impacts the relationship between height and lung capacity. We could add a variable to the model:

\[ \hat{\mu} = -2.277 + 0.05222 x_2 - 0.006830 x_4. \]

Alternatively, we can let \(x_4 = 1\) to produce a systematic component for smokers, and let \(x_4 = 0\) produce a systematic component for non-smokers. (Note we make no changes to the assumed random component of the model.)

We get the following systematic components:

\[ \hat\mu = \begin{cases} -2.277 + 0.05222 x_2 \text{ for non-smokers}\\ -2.284 + 0.05222 x_2 \text{ for smokers}. \end{cases} \]

Note that the slopes are the same, but the intercepts are (slightly) different. Are they different *enough* to warrant the choice of parallel regression lines? How could we check?

Well, note that regardless of the value we choose for \(x_4\), if the coefficient on \(x_4\) is zero, the intercepts for the two different lines are equal, so we need a formal test of the significance of \(\beta_4\), which we do as follows:

`printCoefmat(coef(summary(lm( log(FEV) ~ Ht + Smoke, data=lungcap))))`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.2767801 0.0656677 -34.6712 <2e-16 ***
Ht 0.0522196 0.0010785 48.4174 <2e-16 ***
Smoke -0.0068303 0.0205450 -0.3325 0.7397
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Thus the evidence suggests that different intercepts are not needed.

### Independent Models

In addition to adjusting the intercepts based on values of factors, we can explore adjustments to the slopes via interactions of factors and covariates. In the `lungcap`

dataset, this could look like:

\[ \hat\mu = -2.281 + 0.05230 x_2 - 0.002294 x_4 + 0.002294 x_2.x_4, \tag{1}\]

where \(x_2.x_4\) denotes the interaction between height and smoking status. We can use Equation 1 to produce two separate systematic components via the two factor levels of \(x_4\) (\(=0\) for non-smokers, and \(=1\) for smokers):

\[ \hat\mu = \begin{cases} -2.281 + 0.05230x_2 \text{ for non-smokers}\\ -2.137 + 0.05000x_2 \text{ for smokers}.\end{cases} \]

We thus see that the lines are no longer parallel: the slopes *and* intercepts have been adjusted. One can create this model in `R`

:

`<- lm( log(FEV) ~ Ht + Smoke + Ht:Smoke, data=lungcap) LC.model `

# Comparing Non-nested models

We have thus far explored comparing *nested* models, that is, models which could be considered special case(s) of one another. Often we want to compare models which are not so closely related. Recall that there are two main criteria for model selection: accuracy and parsimony. We measure accuracy via RSS (lower is better), though the inclusion of additional explanatory variables will never cause RSS to increase, and almost always causes it to decrease. At the same time, including more explanatory variables *does* cause an increase in complexity. How can we balance these competing desires?

**Definition 2 (Akaike’s Information Criterion (AIC)) **For a normal linear regression model \(p'\) parameters,

\[ \text{AIC} = n\log(\text{RSS}/n) + 2p' \tag{2}\]

The term \(2p'\) in Equation 2 is called the *penalty*, as it penalizes the more complex linear regression models by a facdtor of \(k = 2\).

Note that the value of AIC is not meaningful; it’s only useful for comparing models against one another. Similar quantities are also defined, such as the Bayesian Information Criterion (BIC):

\[ \text{BIC} = n\log(\text{RSS}/n) + p' \log n. \tag{3}\]

You can calculate both AIC and BIC in `R`

via the `extractAIC()`

function. The AIC is returned by default, whereas the BIC can be returned by specifying `k=log(nobs(fit))`

, where `fit`

is the name of the fitted model and `nobs()`

extracts the number of observations used to fit the model.

# Additional Tools for Model Selection

The text explores additional tools for model selection in Section 2.12. You can read about them there.

# Case Study

Explore the `ants`

dataset in the `GLMsData`

package. Explore different (nested and non-nested) models, and compare them with the tools we discussed above. Be prepared to present your model and explain how you determined which variables and interactions to include.

# Appendix: Exploring the `Term`

dataset

In what follows, we’ll explore the `term`

dataset. About this dataset (source):

Like all firms, life insurance companies continually seek new ways to deliver products to the market. Those involved in product development wish to know “who buys insurance and how much do they buy?” Analysts can readily get information on characteristics of current customers through company databases. Potential customers, those that do not have insurance with the company, are often the main focus for expanding market share. We examine the Survey of Consumer Finances (SCF), a nationally representative sample that contains extensive information on assets, liabilities, in- come, and demographic characteristics of those sampled (potential U.S. customers). We study a random sample of 500 households with positive incomes that were interviewed in the 2004 survey. For term life insurance, the quantity of insurance is measured by the policy

`FACE`

, the amount that the company will pay in the event of the death of the named insured.

```
<- read.csv("https://prof.mkjanssen.org/glm/data/TermLife.csv", header=TRUE) # import the data
Term # PICK THE SUBSET OF THE DATA CORRESPONDING TO TERM PURCHASE
<- subset(Term, subset=FACE > 0) Term2
```

The variables are:

`GENDER`

: The gender of the survey respondendent`AGE`

: Age of the survey respondent`MARSTAT`

: Marital status of the survey respondent (=1 if married, =2 if living with partner, and =0 otherwise)`EDUCATION`

: Number of years of education of the survey respondent`ETHNICITY`

: Ethnicity`SMARSTAT`

: Marital status of the respondent’s spouse`SGENDER`

: Gender of the respondent’s spouse`SAGE`

: Age of the respondent’s spouse`SEDUCATION`

: Education of the respondent’s spouse`NUMHH`

: Number of household members`INCOME`

: Annual income of the familiy`CHARITY`

: Charitable contributions`FACE`

: Amount that the company will pay in the event of the death of the named insured`FACECVLIFEPOLICIES`

: Face amount of life insurance policy with a cash value`CASHCVLIFEPOLICIES`

: Cash value of life insurance policy with a cash value`BORROWCVLIFEPOL`

: Amount borrowed on life insurance policy with a cash value`NETVALUE`

: Net amount at risk on life insurance policy with a cash value

**Your job**: work with up to one other person to develop a multiple linear regression model to predict the amount of coverage a customer may buy. Make sure you do proper exploratory data analysis (you may find you want to do a log-transform on one or more variables!). Identify significance levels for your parameter estimates, as well as confidence intervals for the parameters and fitted value for representative customers. You should also explain the proportion of the variation predicted by your model compared to the variation attributed to the random component.