Course Intro

Stat 203: Generalized Linear Models

Dr. Janssen

About this course

About Dr. Janssen

  • Tenth year at Dordt
  • Ph.D. in math from Nebraska
  • Family: Laura, Lila (7), Sam (4), Arthur (🐈)
  • Interests: running, board games

Summer 2023 at Custer State Park

Introduction

  • Focus on modeling
  • GLM: framework that unifies many types of regression
  • Hands-on work in R
  • Interactive lecture-style class
  • Required text: Generalized Linear Models with Examples in R

Types of Work

  • Weekly homework, typed and submitted as an Rmarkdown file to Canvas (due Fridays)
  • Two midterm exams: October 4, November 10 (take-home possible)
  • Final exam: Monday, December 18, 1:15-3:15pm
  • Applied project
  • Canvas tour

Set up R

Open your computers and fire up R. Then download the following (instructions on p. 505 of the text):

  • GLMsData
  • MASS
  • statmod
  • tweedie

Then make sure the box is checked by each of them in the Packages pane (or type, e.g., library(GLMsData) in the console).

Section 1.2: Describing Data

Example: Describing data

library(GLMsData)   # Load GLMsData package (if not loaded already)
data(lungcap)       # Make the dataset lungcap available for use
head(lungcap)       # Show the first few lines of lungcap
  Age   FEV Ht Gender Smoke
1   3 1.072 46      F     0
2   4 0.839 48      F     0
3   4 1.102 48      F     0
4   4 1.389 48      F     0
5   4 1.577 49      F     0
6   4 1.418 49      F     0

Example: Describing data

library(GLMsData)   # Load GLMsData package (if not loaded already)
data(lungcap)       # Make the dataset lungcap available for use
head(lungcap)       # Show the first few lines of lungcap
  Age   FEV Ht Gender Smoke
1   3 1.072 46      F     0
2   4 0.839 48      F     0
3   4 1.102 48      F     0
4   4 1.389 48      F     0
5   4 1.577 49      F     0
6   4 1.418 49      F     0
head(lungcap$Age)
[1] 3 4 4 4 4 4

Example: Describing data

library(GLMsData)   # Load GLMsData package (if not loaded already)
data(lungcap)       # Make the dataset lungcap available for use
head(lungcap)       # Show the first few lines of lungcap
  Age   FEV Ht Gender Smoke
1   3 1.072 46      F     0
2   4 0.839 48      F     0
3   4 1.102 48      F     0
4   4 1.389 48      F     0
5   4 1.577 49      F     0
6   4 1.418 49      F     0
head(lungcap$Age)   # Show the first six values of Age
[1] 3 4 4 4 4 4
tail(lungcap$Gender) # Show the last six values of Gender
[1] M M M M M M
Levels: F M

Example: Describing data

library(GLMsData)   # Load GLMsData package (if not loaded already)
data(lungcap)       # Make the dataset lungcap available for use
head(lungcap)       # Show the first few lines of lungcap
  Age   FEV Ht Gender Smoke
1   3 1.072 46      F     0
2   4 0.839 48      F     0
3   4 1.102 48      F     0
4   4 1.389 48      F     0
5   4 1.577 49      F     0
6   4 1.418 49      F     0
head(lungcap$Age)   # Show the first six values of Age
[1] 3 4 4 4 4 4
tail(lungcap$Gender) # Show the last six values of Gender
[1] M M M M M M
Levels: F M
length(lungcap$Age)
[1] 654

Example: Describing data

library(GLMsData)   # Load GLMsData package (if not loaded already)
data(lungcap)       # Make the dataset lungcap available for use
head(lungcap)       # Show the first few lines of lungcap
  Age   FEV Ht Gender Smoke
1   3 1.072 46      F     0
2   4 0.839 48      F     0
3   4 1.102 48      F     0
4   4 1.389 48      F     0
5   4 1.577 49      F     0
6   4 1.418 49      F     0
head(lungcap$Age)   # Show the first six values of Age
[1] 3 4 4 4 4 4
tail(lungcap$Gender) # Show the last six values of Gender
[1] M M M M M M
Levels: F M
length(lungcap$Age) 
[1] 654
dim(lungcap)
[1] 654   5

Talking about data

  • \(n\) denotes the size of the dataset; \(n = 654\)
  • We use \(y\) to denote the response; \(y_i\) refers to the \(i\)th value of the response
  • We typically use \(x\)’s to denote explanatory variables: \(x_1\) is the first explanatory variable, \(x_{1,1}\) the first value of the first explanatory variable, etc.
  • Factors are explanatory variables that are qualitative, like gender
  • Covariates are explanatory variables that are quantitative

Other means of exploring datasets

str(lungcap)
'data.frame':   654 obs. of  5 variables:
 $ Age   : int  3 4 4 4 4 4 4 5 5 5 ...
 $ FEV   : num  1.072 0.839 1.102 1.389 1.577 ...
 $ Ht    : num  46 48 48 48 49 49 50 46.5 49 49 ...
 $ Gender: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
 $ Smoke : int  0 0 0 0 0 0 0 0 0 0 ...

Other means of exploring datasets

str(lungcap)
'data.frame':   654 obs. of  5 variables:
 $ Age   : int  3 4 4 4 4 4 4 5 5 5 ...
 $ FEV   : num  1.072 0.839 1.102 1.389 1.577 ...
 $ Ht    : num  46 48 48 48 49 49 50 46.5 49 49 ...
 $ Gender: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
 $ Smoke : int  0 0 0 0 0 0 0 0 0 0 ...
summary(lungcap)
      Age              FEV              Ht        Gender      Smoke        
 Min.   : 3.000   Min.   :0.791   Min.   :46.00   F:318   Min.   :0.00000  
 1st Qu.: 8.000   1st Qu.:1.981   1st Qu.:57.00   M:336   1st Qu.:0.00000  
 Median :10.000   Median :2.547   Median :61.50           Median :0.00000  
 Mean   : 9.931   Mean   :2.637   Mean   :61.14           Mean   :0.09939  
 3rd Qu.:12.000   3rd Qu.:3.119   3rd Qu.:65.50           3rd Qu.:0.00000  
 Max.   :19.000   Max.   :5.793   Max.   :74.00           Max.   :1.00000  

On Smoke

The variable Smoke is qualitative, but stored as a 0 or 1. We can make explicit that Smoke is a factor as follows:

lungcap$Smoke <- factor(lungcap$Smoke,
                  levels=c(0, 1),                  # The values of  Smoke
                  labels=c("Non-smoker","Smoker")) # The labels

Exploration

Explore the following datasets in the GLMsData package:

  • punting
  • lime
  • dyouth

punting

data(punting)
head(punting)
  Left Right   Punt
1  170   170 162.50
2  130   140 144.00
3  170   180 174.50
4  160   160 163.50
5  150   170 192.00
6  150   150 171.75
dim(punting)
[1] 13  3
str(punting)
'data.frame':   13 obs. of  3 variables:
 $ Left : int  170 130 170 160 150 150 180 110 110 120 ...
 $ Right: int  170 140 180 160 170 150 170 110 120 130 ...
 $ Punt : num  162 144 174 164 192 ...
summary(punting)
      Left           Right            Punt      
 Min.   :110.0   Min.   :110.0   Min.   :104.8  
 1st Qu.:130.0   1st Qu.:130.0   1st Qu.:140.2  
 Median :150.0   Median :150.0   Median :162.0  
 Mean   :143.8   Mean   :147.7   Mean   :150.3  
 3rd Qu.:160.0   3rd Qu.:170.0   3rd Qu.:165.2  
 Max.   :180.0   Max.   :180.0   Max.   :192.0  

lime

data(lime)
head(lime)
  Foliage  DBH Age  Origin
1     0.1  4.0  38 Natural
2     0.2  6.0  38 Natural
3     0.4  8.0  46 Natural
4     0.6  9.6  44 Natural
5     0.6 11.3  60 Natural
6     0.8 13.7  56 Natural
dim(lime)
[1] 385   4
str(lime)
'data.frame':   385 obs. of  4 variables:
 $ Foliage: num  0.1 0.2 0.4 0.6 0.6 0.8 1 1.4 1.7 3.5 ...
 $ DBH    : num  4 6 8 9.6 11.3 13.7 15.4 17.8 18 22 ...
 $ Age    : int  38 38 46 44 60 56 72 74 68 79 ...
 $ Origin : Factor w/ 3 levels "Coppice","Natural",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(lime)
    Foliage            DBH             Age             Origin   
 Min.   : 0.010   Min.   : 1.60   Min.   : 10.00   Coppice:133  
 1st Qu.: 0.440   1st Qu.:10.20   1st Qu.: 32.00   Natural:185  
 Median : 1.060   Median :15.80   Median : 46.00   Planted: 67  
 Mean   : 1.872   Mean   :16.33   Mean   : 49.55                
 3rd Qu.: 2.340   3rd Qu.:21.70   3rd Qu.: 66.00                
 Max.   :14.080   Max.   :38.90   Max.   :141.00                

dyouth

data(dyouth)
head(dyouth)
  Obs   Age Group Gender Depression
1  79 12-14    LD      M          L
2  18 12-14    LD      M          H
3  34 12-14    LD      F          L
4  14 12-14    LD      F          H
5  14 12-14   SED      M          L
6   5 12-14   SED      M          H
dim(dyouth)
[1] 24  5
str(dyouth)
'data.frame':   24 obs. of  5 variables:
 $ Obs       : int  79 18 34 14 14 5 5 8 63 10 ...
 $ Age       : Factor w/ 3 levels "12-14","15-16",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ Group     : Factor w/ 2 levels "LD","SED": 1 1 1 1 2 2 2 2 1 1 ...
 $ Gender    : Factor w/ 2 levels "F","M": 2 2 1 1 2 2 1 1 2 2 ...
 $ Depression: Factor w/ 2 levels "H","L": 2 1 2 1 2 1 2 1 2 1 ...
summary(dyouth)
      Obs           Age    Group    Gender Depression
 Min.   : 1.00   12-14:8   LD :12   F:12   H:12      
 1st Qu.: 6.50   15-16:8   SED:12   M:12   L:12      
 Median :13.50   17-18:8                             
 Mean   :19.38                                       
 3rd Qu.:27.50                                       
 Max.   :79.00