Plotting Data/Two Components of a Statistical Model

Stat 203 Lecture 2

Dr. Janssen

Plotting Data

Getting Started

Let’s look at how to create some plots in R.

library(GLMsData)
data(lungcap)
plot( FEV ~ Age, data=lungcap, 
       xlab="Age (in years)",  # The x-axis label
       ylab="FEV (in L)",      # The y-axis label
       main="FEV vs age",      # The main title
       xlim=c(0, 20),          # Explicitly set x-axis limits
       ylim=c(0, 6),           # Explicitly set y-axis limits
       las=1)                  # Makes axis labels horizontal

Getting Started

Let’s look at how to create some plots in R.

library(GLMsData)
data(lungcap)
plot( FEV ~ Age, data=lungcap, 
       xlab="Age (in years)",  # The x-axis label
       ylab="FEV (in L)",      # The y-axis label
       main="FEV vs age",      # The main title
       xlim=c(0, 20),          # Explicitly set x-axis limits
       ylim=c(0, 6),           # Explicitly set y-axis limits
       las=1)                  # Makes axis labels horizontal

FEV vs Ht

Code
plot( FEV ~ Ht, data=lungcap, main="FEV vs height",
       xlab="Height (in inches)", ylab="FEV (in L)",
       las=1, ylim=c(0, 6) )

FEV vs Gender

Code
plot( FEV ~ Gender, data=lungcap,
       main="FEV vs gender", ylab="FEV (in L)",
       las=1, ylim=c(0, 6))

FEV vs Smoke

Code
lungcap$Smoke <- factor(lungcap$Smoke,
                  levels=c(0, 1),                 
                  labels=c("Non-smoker","Smoker"))
plot( FEV ~ Smoke,  data=lungcap, main="FEV vs Smoking status",
       ylab="FEV (in L)", xlab="Smoking status",
       las=1, ylim=c(0, 6))

FEV vs Age and FEV vs Ht

FEV vs Age

Code
plot( FEV ~ Age,
    data=subset(lungcap, Smoke=="Smoker"),  # Only select smokers
    main="FEV vs age\nfor smokers",         # \n means `new line'
    ylab="FEV (in L)", xlab="Age (in years)",
    ylim=c(0, 6), xlim=c(0, 20), las=1)

Code
plot( FEV ~ Age,
    data=subset(lungcap, Smoke=="Non-smoker"),  # Only select non-smokers
    main="FEV vs age\nfor non-smokers",
    ylab="FEV (in L)", xlab="Age (in years)",
    ylim=c(0, 6), xlim=c(0, 20), las=1)

FEV vs Height

Code
plot( FEV ~ Ht, data=subset(lungcap, Smoke=="Smoker"),
    main="FEV vs height\nfor smokers",
    ylab="FEV (in L)", xlab="Height (in inches)",
    xlim=c(45, 75), ylim=c(0, 6), las=1)

Code
plot( FEV ~ Ht, data=subset(lungcap, Smoke=="Non-smoker"),
    main="FEV vs height\nfor non-smokers",
    ylab="FEV (in L)",  xlab="Height (in inches)",
    xlim=c(45, 75), ylim=c(0, 6), las=1)

Explore!

There are other plots in the text on pp. 8-9; take a look at them, and then explore one of the datasets from last time:

  • punting
  • lime
  • dyouth

lime

Code
data(lime)
plot(Foliage ~ DBH, data=subset(lime,Origin=="Planted"),
     xlab="Diameter (breast height, in cm)",
     ylab="Foliage (biomass, in kg)",
        main="Foliage vs Diameter")

Code
boxplot(lime$Foliage ~ lime$Origin, xlab="Origin", ylab="Foliage (biomass, in kg)", main="Foliage vs. Origin")

Code
plot(lime$Foliage ~ lime$Age,
     xlab="Age of the tree (in years)",
     ylab="Foliage (biomass, in kg)",
     main="Foliage vs Age"
     )

Code
boxplot(lime$Foliage ~ lime$Origin,
        xlab="Origin",
        ylab="Foliage (biomass, in kg)",
        main="Foliage vs Origin",
        las=1)

Coding for Factors

Mathematizing Factors

  • How can we incorporate factors in a statistical model?
  • By coding them.

Example

head(lungcap$Gender)
[1] F F F F F F
Levels: F M

Example

head(lungcap$Gender)
[1] F F F F F F
Levels: F M
contrasts(lungcap$Gender)
  M
F 0
M 1

Two Parts of a Statistical Model

The Random Component

Definition 1 For a given combination of explanatory variables, a model for the distribution of a recorded response variable is called the random component.

Definition 2 The systematic component of a model is the mathematical relationship between the mean of the response and the explanatory variables.

Example

Consider the lime dataset. A potential systematic component is

\[ \mu_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} +\beta_4 x_{4i}, \qquad(1)\]

where \(\mu_i = E[y_i]\) is the expected value of \(y_i\), and the \(\beta_j\)’s are unknown regression parameters.

The explanatory variables are the \(x_i\)’s: DBH, Age, Origin.

The random component can take many forms; if we assume \(y_i \sim N(\mu_i,\sigma^2)\), we are assuming that the \(y_i\)’s are normally distributed about \(\mu_i\) with constant variance \(\sigma^2\).

Major Goal

In this class, we’ll explore:

  • Linear regression models
  • Generalized linear models