library(GLMsData); data(lungcap);
.0 <- lm(log(FEV) ~ 1, data=lungcap) # no explanatory variables
LC.1 <- update(LC.0, . ~ . + Age) # Age
LC.2 <- update(LC.1, . ~ . + Ht) # Age and Height
LC.3 <- update(LC.2, . ~ . + Gender) # Age, Height, and Gender
LC.4 <- update(LC.3, . ~ . + Smoke) # Then, add smoking status LC
Comparing Models
Stat 203 Lecture 8
The Problem
When we’re handed a dataset, building a model is fairly straightforward. Building the best model is harder–how do you compare two distinct models? How do you compare several (perhaps, many)? This is the question we’ll explore here.
Two nested models
Definition 1 We say linear regression model A is nested in linear regression model B if model A can be obtained from model B by setting one or more parameters in model B to zero.
As we have seen, increasing the number of explanatory variables does tend to increase the explanatory power of the model, but at the cost of additional complexity from the reduction in the degrees of freedom. How can we tell if this increased explanatory power, described by a reduction in the RSS, is statistically significant?
First, we calculate:
- The difference in degrees of freedom from model A to model B, \(\Delta\text{df}\)
- The sum-of-squares, which is the reduction in RSS from A to B: \(\text{SS} = \text{RSS}(A) - \text{RSS}(B)\)
- The RSS of \(B\) and the degrees of freedom of \(B\), \(\text{RSS}(B)\) and \(\text{df}(B)\)
Then, we calculate an \(F\)-statistic:
\[ F = \frac{\text{SS}/\Delta\text{df}}{\text{RSS}(B)/\text{df}(B)}. \]
We then compare \(F\) to an \(F\)-distribution with \((\Delta\text{df},\text{df}(B))\) degrees of freedom, using, say, pf()
. If the \(p\)-value suggests significance, then model B is a significant improvement over A. If we have created two linear models model_A
and model_B
in R
, we can use anova()
to do the analysis: anova(model_A, model_B)
.
Exploration
You recently built a model to predict the quantity of insurance that a person may want to buy. During that process, you made some choices about which variable(s) to include. Explore two nested models you considered use the process above to determine whether the tradeoff of increased complexity of the larger model was warranted given the significance of the calculated \(F\)-statistic. Justify your conclusion.
Sequential analysis of variance
More generally, we may wish to compare a sequence of nested models. This is most easily illustrated via an example.
A brief word on the syntax: the update()
function takes as its first argument the model to be updated, and then the model specification to change. The first dot indicates that the left-hand side should stay the same, while the second dot indicates that we should keep the right-hand side from the model we’re updating, before adding the next explanatory variable.
We then calculate an anova()
table using R
:
anova(LC.4)
Analysis of Variance Table
Response: log(FEV)
Df Sum Sq Mean Sq F value Pr(>F)
Age 1 43.210 43.210 2041.9564 < 2.2e-16 ***
Ht 1 15.326 15.326 724.2665 < 2.2e-16 ***
Gender 1 0.153 0.153 7.2451 0.007293 **
Smoke 1 0.103 0.103 4.8537 0.027937 *
Residuals 649 13.734 0.021
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output of the ANOVA table is the result of a series of sequential tests performed. The last one formally tests is Smoke
is significant in the model, given that Age
, Ht
, and Gender
are already in the model. In other words, the \(F\)-test adjusts for the previously-added variables. This means that the order in which the variables are added to the model matters! It may be that a variable is significant when added first, but not when added later and adjusted for the variables already present.
Exploration
Compare a sequence of at least four nested models on the Term
dataset. What do you observe? What do you wonder?
Parallel and Independent Regression Models
Parallel Models
Many (most?) of the models we explore will contain dummy variables representing some factor. We can use these variables to produce parallel systematic components which are different based on the factor level.
Example 1 In the lungcap
data, we can produce a model for \(\mu = \log(y)\) based on height:
\[ \hat{\mu} = -2.271 + 0.05212 x_2. \]
We might wonder if smoking status impacts the relationship between height and lung capacity. We could add a variable to the model:
\[ \hat{\mu} = -2.277 + 0.05222 x_2 - 0.006830 x_4. \]
Alternatively, we can let \(x_4 = 1\) to produce a systematic component for smokers, and let \(x_4 = 0\) produce a systematic component for non-smokers. (Note we make no changes to the assumed random component of the model.)
We get the following systematic components:
\[ \hat\mu = \begin{cases} -2.277 + 0.05222 x_2 \text{ for non-smokers}\\ -2.284 + 0.05222 x_2 \text{ for smokers}. \end{cases} \]
Note that the slopes are the same, but the intercepts are (slightly) different. Are they different enough to warrant the choice of parallel regression lines? How could we check?
Well, note that regardless of the value we choose for \(x_4\), if the coefficient on \(x_4\) is zero, the intercepts for the two different lines are equal, so we need a formal test of the significance of \(\beta_4\), which we do as follows:
printCoefmat(coef(summary(lm( log(FEV) ~ Ht + Smoke, data=lungcap))))
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.2767801 0.0656677 -34.6712 <2e-16 ***
Ht 0.0522196 0.0010785 48.4174 <2e-16 ***
Smoke -0.0068303 0.0205450 -0.3325 0.7397
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Thus the evidence suggests that different intercepts are not needed.
Independent Models
In addition to adjusting the intercepts based on values of factors, we can explore adjustments to the slopes via interactions of factors and covariates. In the lungcap
dataset, this could look like:
\[ \hat\mu = -2.281 + 0.05230 x_2 - 0.002294 x_4 + 0.002294 x_2.x_4, \tag{1}\]
where \(x_2.x_4\) denotes the interaction between height and smoking status. We can use Equation 1 to produce two separate systematic components via the two factor levels of \(x_4\) (\(=0\) for non-smokers, and \(=1\) for smokers):
\[ \hat\mu = \begin{cases} -2.281 + 0.05230x_2 \text{ for non-smokers}\\ -2.137 + 0.05000x_2 \text{ for smokers}.\end{cases} \]
We thus see that the lines are no longer parallel: the slopes and intercepts have been adjusted. One can create this model in R
:
<- lm( log(FEV) ~ Ht + Smoke + Ht:Smoke, data=lungcap) LC.model
Comparing Non-nested models
We have thus far explored comparing nested models, that is, models which could be considered special case(s) of one another. Often we want to compare models which are not so closely related. Recall that there are two main criteria for model selection: accuracy and parsimony. We measure accuracy via RSS (lower is better), though the inclusion of additional explanatory variables will never cause RSS to increase, and almost always causes it to decrease. At the same time, including more explanatory variables does cause an increase in complexity. How can we balance these competing desires?
Definition 2 (Akaike’s Information Criterion (AIC)) For a normal linear regression model \(p'\) parameters,
\[ \text{AIC} = n\log(\text{RSS}/n) + 2p' \tag{2}\]
The term \(2p'\) in Equation 2 is called the penalty, as it penalizes the more complex linear regression models by a facdtor of \(k = 2\).
Note that the value of AIC is not meaningful; it’s only useful for comparing models against one another. Similar quantities are also defined, such as the Bayesian Information Criterion (BIC):
\[ \text{BIC} = n\log(\text{RSS}/n) + p' \log n. \tag{3}\]
You can calculate both AIC and BIC in R
via the extractAIC()
function. The AIC is returned by default, whereas the BIC can be returned by specifying k=log(nobs(fit))
, where fit
is the name of the fitted model and nobs()
extracts the number of observations used to fit the model.
Additional Tools for Model Selection
The text explores additional tools for model selection in Section 2.12. You can read about them there.
Case Study
Explore the ants
dataset in the GLMsData
package. Explore different (nested and non-nested) models, and compare them with the tools we discussed above. Be prepared to present your model and explain how you determined which variables and interactions to include.
Appendix: Exploring the Term
dataset
In what follows, we’ll explore the term
dataset. About this dataset (source):
Like all firms, life insurance companies continually seek new ways to deliver products to the market. Those involved in product development wish to know “who buys insurance and how much do they buy?” Analysts can readily get information on characteristics of current customers through company databases. Potential customers, those that do not have insurance with the company, are often the main focus for expanding market share. We examine the Survey of Consumer Finances (SCF), a nationally representative sample that contains extensive information on assets, liabilities, in- come, and demographic characteristics of those sampled (potential U.S. customers). We study a random sample of 500 households with positive incomes that were interviewed in the 2004 survey. For term life insurance, the quantity of insurance is measured by the policy
FACE
, the amount that the company will pay in the event of the death of the named insured.
<- read.csv("https://prof.mkjanssen.org/glm/data/TermLife.csv", header=TRUE) # import the data
Term # PICK THE SUBSET OF THE DATA CORRESPONDING TO TERM PURCHASE
<- subset(Term, subset=FACE > 0) Term2
The variables are:
GENDER
: The gender of the survey respondendentAGE
: Age of the survey respondentMARSTAT
: Marital status of the survey respondent (=1 if married, =2 if living with partner, and =0 otherwise)EDUCATION
: Number of years of education of the survey respondentETHNICITY
: EthnicitySMARSTAT
: Marital status of the respondent’s spouseSGENDER
: Gender of the respondent’s spouseSAGE
: Age of the respondent’s spouseSEDUCATION
: Education of the respondent’s spouseNUMHH
: Number of household membersINCOME
: Annual income of the familiyCHARITY
: Charitable contributionsFACE
: Amount that the company will pay in the event of the death of the named insuredFACECVLIFEPOLICIES
: Face amount of life insurance policy with a cash valueCASHCVLIFEPOLICIES
: Cash value of life insurance policy with a cash valueBORROWCVLIFEPOL
: Amount borrowed on life insurance policy with a cash valueNETVALUE
: Net amount at risk on life insurance policy with a cash value
Your job: work with up to one other person to develop a multiple linear regression model to predict the amount of coverage a customer may buy. Make sure you do proper exploratory data analysis (you may find you want to do a log-transform on one or more variables!). Identify significance levels for your parameter estimates, as well as confidence intervals for the parameters and fitted value for representative customers. You should also explain the proportion of the variation predicted by your model compared to the variation attributed to the random component.