Ling 105: session 02: linear regr. 1 (key)

Author
Affiliations

UC Santa Barbara

JLU Giessen

Published

08 Apr 2024 12-34-56

1 Introduction

Let’s load a data set for a few linear modeling examples; the data are in _input/rts.csv and you can find information about the variables/columns in _input/rts.r.

rm(list=ls(all.names=TRUE))
library(car); library(effects); library(magrittr)
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/rts.csv",  # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
      CASE            RT             LENGTH         LANGUAGE         GROUP
 Min.   :   1   Min.   : 271.0   Min.   :3.000   english:4023   english :3961
 1st Qu.:2150   1st Qu.: 505.0   1st Qu.:4.000   spanish:3977   heritage:4039
 Median :4310   Median : 595.0   Median :5.000
 Mean   :4303   Mean   : 661.5   Mean   :5.198
 3rd Qu.:6450   3rd Qu.: 732.0   3rd Qu.:6.000
 Max.   :8610   Max.   :4130.0   Max.   :9.000
                NA's   :248
      CORRECT        FREQPMW           ZIPFFREQ     SITE
 correct  :7749   Min.   :   1.00   Min.   :3.000   a:2403
 incorrect: 251   1st Qu.:  17.00   1st Qu.:4.230   b:2815
                  Median :  42.00   Median :4.623   c:2782
                  Mean   :  81.14   Mean   :4.591
                  3rd Qu.: 101.00   3rd Qu.:5.004
                  Max.   :1152.00   Max.   :6.061
                                                            

As in the book, we will first discuss some didactically motivated monofactorial models before we go over a ‘proper’ multifactorial modeling approach.

2 Deviance

A useful thing to consider is deviance, which we will here define as the overall amount of variability in the response variable. As we talked about before in class, the response variable has a mean, but the response variable’s values are not all the mean, they vary around it:

summary(d$RT)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
  271.0   505.0   595.0   661.5   732.0  4130.0     248 

One way to express the degree to which the values of RT vary around the mean would be the average absolute deviation:

mean(abs(d$RT - mean(d$RT, na.rm=TRUE)), na.rm=TRUE)
[1] 172.5416

But as you know, this is not how we usually quantify deviations from the mean: As in the standard deviation, we usually square the differences between the observed values and the mean and then add them up to get the so-called deviance:

sum((d$RT - mean(d$RT, na.rm=TRUE))^2, na.rm=TRUE)
[1] 536471586

This number quantifies how much all data points together deviate from the mean. Another useful way to look at this would be to already use a linear model to quantify this variability of the values around the mean. We can do this by fitting a so-called null model, i.e. a model with no predictors:

summary(m_00 <- lm(       # make/summarize the linear model m_00:
   RT ~ 1,                # RT ~ an overall intercept (1) & no predictor
   data=d,                # those variables are in d
   na.action=na.exclude)) # skip cases with NA/missing data (good habit)

Call:
lm(formula = RT ~ 1, data = d, na.action = na.exclude)

Residuals:
   Min     1Q Median     3Q    Max
-390.5 -156.5  -66.5   70.5 3468.5

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  661.469      2.988   221.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 263.1 on 7751 degrees of freedom
  (248 observations deleted due to missingness)

That model’s intercept is the overall mean of the response variable and that model’s deviance is the same as what we computed manually just above:

deviance(m_00)
[1] 536471586

Note that this null model actually returns the same result as when you use a one-sample t-test on the reaction times; compare the mean and the t-value:

t.test(d$RT)

    One Sample t-test

data:  d$RT
t = 221.37, df = 7751, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 655.6111 667.3259
sample estimates:
mean of x
 661.4685 

And this is how we’d get the confidence intervals that t.test provides from the regression model:

Confint(m_00)
            Estimate    2.5 %   97.5 %
(Intercept) 661.4685 655.6111 667.3259

And the manual computation would be this:

# lower bound:
coef(m_00)[1] +              # the intercept plus
   summary(m_00)$coef[1,2] * # the intercept's se
   qt(0.025, df=7751)        # the t-value for 2.5% at that df
(Intercept)
   655.6111 
# upper bound:
coef(m_00)[1] +              # the intercept plus
   summary(m_00)$coef[1,2] * # the intercept's se
   qt(0.975, df=7751)        # the t-value for 2.5% at that df
(Intercept)
   667.3259 

Since our above deviance value is the overall amount of variability of the response variable and, at the same time, an expression of how bad a model with no predictors is at making predictions, we are now of course hoping that our predictors can reduce that number …

3 A numeric predictor

Does the reaction time to a word (in ms) vary as a function of the frequency of that word (normalized to per million words)?

3.1 Exploration & preparation

Some exploration of the response variable:

hist(d$RT); hist(d$RT, breaks="FD")

Very long tail on the right … Let’s see where ‘the middle 90%’ of the data are, which might be useful as a good plotting range:

quantile(d$RT, probs=seq(0, 1, 0.05), na.rm=TRUE)
     0%      5%     10%     15%     20%     25%     30%     35%     40%     45%
 271.00  417.00  446.00  469.65  487.00  505.00  522.00  538.00  556.00  576.00
    50%     55%     60%     65%     70%     75%     80%     85%     90%     95%
 595.00  614.00  640.00  665.00  695.00  732.00  777.00  834.00  929.00 1137.45
   100%
4130.00 

Ok, so the middle 90% of the data are in [417, 1137.45], which means that some values like that might make for good y-axis limits (e.g. for some plots). Now the predictor:

hist(d$FREQPMW); hist(d$FREQPMW, breaks="FD")

# the predictor(s) w/ the response
plot(
   main="RT in ms as a function of frequency per million words",
   pch=16, col="#00000030",
   xlab="Frequency pmw", xlim=c(  0, 1200), x=d$FREQPMW,
   ylab="RT in ms"     , ylim=c(250, 4250), y=d$RT); grid()
abline(lm(d$RT ~ d$FREQPMW), col="red", lwd=2)

3.2 Modeling & numerical interpretation

How would we have dealt with this in Ling 104? We would have computed a Pearson product-moment correlation coefficient r (or Spearman’s ρ, either with cor.test) and maybe a simple linear model (with lm). Thus, we might have done something like this (for Pearson’s r) …

cor.test(d$RT, d$FREQPMW)

    Pearson's product-moment correlation

data:  d$RT and d$FREQPMW
t = -5.8629, df = 7750, p-value = 4.734e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.08858155 -0.04425515
sample estimates:
        cor
-0.06645113 

… or this for the simplest possible linear regression model:

summary(m_01 <- lm(       # make/summarize the linear model m_01:
   RT ~ 1 +               # RT ~ an overall intercept (1)
   FREQPMW,               # & the predictor FREQPMW
   data=d,                # those variables are in d
   na.action=na.exclude)) # skip cases with NA/missing data (good habit)

Call:
lm(formula = RT ~ 1 + FREQPMW, data = d, na.action = na.exclude)

Residuals:
   Min     1Q Median     3Q    Max
-396.3 -156.1  -68.9   72.4 3468.1

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 673.83892    3.65267 184.479  < 2e-16 ***
FREQPMW      -0.14926    0.02546  -5.863 4.73e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 262.5 on 7750 degrees of freedom
  (248 observations deleted due to missingness)
Multiple R-squared:  0.004416,  Adjusted R-squared:  0.004287
F-statistic: 34.37 on 1 and 7750 DF,  p-value: 4.734e-09

Note the equivalence of the classical Pearson’s r and the linear model we fit:

  • the t-statistic from Pearson’s r (-5.8629236) squared is m_01’s F-value (34.3738736);
  • the t-statistic from Pearson’s r (-5.8629236) is also the t-statistic of FREQPMW (-5.8629236);
  • the df of Pearson’s r (7750) is also the df2 of m_01 (7750);
  • the value of Pearson’s r (-0.0664511) squared is m_01’s Multiple R-squared (0.0044158).

But that correlation/model can’t be ‘right’, given the descriptive/exploratory plots. Thus, we ask, does the reaction time to a word (in ms) vary as a function of a logged version of the frequency of that word? For that we use the already provided variable ZIPFFREQ so let’s explore the new predictor:

# the new predictor(s) on its/their own
hist(d$ZIPFFREQ); hist(d$ZIPFFREQ, breaks="FD")

# the predictor(s) w/ the response
plot(
   main="RT in ms as a function of frequency per million words",
   pch=16, col="#00000030",
   xlab="Zipf frequency", xlim=c(  3,    6), x=d$ZIPFFREQ,
   ylab="RT in ms"      , ylim=c(250, 4250), y=d$RT); grid()
abline(lm(d$RT ~ d$ZIPFFREQ), col="red", lwd=2)

Let’s fit our new/second regression model:

summary(m_02 <- lm(       # make/summarize the linear model m_02:
   RT ~ 1 +               # RT ~ an overall intercept (1)
   ZIPFFREQ,              # & the predictor ZIPFFREQ (which involves logging)
   data=d,                # those vars are in d
   na.action=na.exclude)) # skip cases with NA/missing data (good habit)

Call:
lm(formula = RT ~ 1 + ZIPFFREQ, data = d, na.action = na.exclude)

Residuals:
   Min     1Q Median     3Q    Max
-411.4 -153.1  -64.6   71.5 3486.9

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   949.36      25.11   37.80   <2e-16 ***
ZIPFFREQ      -62.46       5.41  -11.54   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 260.9 on 7750 degrees of freedom
  (248 observations deleted due to missingness)
Multiple R-squared:  0.01691,   Adjusted R-squared:  0.01678
F-statistic: 133.3 on 1 and 7750 DF,  p-value: < 2.2e-16

What are the confidence intervals of the coefficients?

Confint(m_02)
             Estimate     2.5 %    97.5 %
(Intercept) 949.36052 900.13179 998.58925
ZIPFFREQ    -62.45655 -73.06186 -51.85124

And the manual computation would be this:

# lower bound for the intercept:
coef(m_02)[1] +              # the intercept plus
   summary(m_02)$coef[1,2] * # the intercept's se
   qt(0.025, df=7750)        # the t-value for 2.5% at that df
(Intercept)
   900.1318 
# upper bound for the intercept:
coef(m_02)[1] +              # the intercept plus
   summary(m_02)$coef[1,2] * # the intercept's se
   qt(0.975, df=7750)        # the t-value for 2.5% at that df
(Intercept)
   998.5892 
# lower bound for the slope:
coef(m_02)[2] +              # the slope plus
   summary(m_02)$coef[2,2] * # the slope's se
   qt(0.025, df=7750)        # the t-value for 2.5% at that df
 ZIPFFREQ
-73.06186 
# upper bound for the slope:
coef(m_02)[2] +              # the slope plus
   summary(m_02)$coef[2,2] * # the slope's se
   qt(0.975, df=7750)        # the t-value for 2.5% at that df
 ZIPFFREQ
-51.85124 

What is the deviance of our model with its one predictor?

deviance(m_02)
[1] 527402124

Wow, that doesn’t seem to have reduced the deviance much from that of the null model, which was 5.3647159^{8}. How much is the reduction in %?

"/"(
   deviance(m_00)-deviance(m_02),
   deviance(m_00))
[1] 0.01690576

Does that number look familiar?

Let’s now also add the predictions of the model to the original data frame (usually a good (house-keeping) habit to adopt):

d$PRED <-   # make d$PRED the
   predict( # predictions for all cases
   m_01)    # based on m_01
head(d) # check the result
  CASE   RT LENGTH LANGUAGE    GROUP CORRECT FREQPMW ZIPFFREQ SITE     PRED
1    1  748      5  spanish heritage correct      19 4.278754    b 671.0029
2    2  408      5  spanish heritage correct      78 4.892095    b 662.1964
3    3  483      6  spanish heritage correct     233 5.367356    b 639.0605
4    4 1266      4  spanish heritage correct     133 5.123852    b 653.9869
5    5  636      6  spanish heritage correct      14 4.146128    b 671.7492
6    6  395      6  spanish heritage correct      67 4.826075    b 663.8383

3.3 Visual interpretation

Let’s visualize the nature of the effect of ZIPFFREQ based on the predictions from effects:

(zpf_d <- data.frame( # make zpf_d a data frame of
   zpf <- effect(     # the effect
      "ZIPFFREQ",     # of ZIPFFREQ
      m_02)))         # in m_02
  ZIPFFREQ      fit       se    lower    upper
1      3.0 761.9909 9.197751 743.9608 780.0209
2      3.8 712.0256 5.287484 701.6607 722.3905
3      4.5 668.3060 3.021493 662.3831 674.2290
4      5.3 618.3408 4.768120 608.9940 627.6876
5      6.1 568.3756 8.591008 551.5349 585.2163
plot(zpf,                        # plot the effect of ZIPFFREQ in m_02
   xlab="Zipf frequency",        # w/ this x-axis label
   ylab="RT in ms",              # w/ this y-axis label
   ylim=range(d$RT, na.rm=TRUE), # w/ these y-axis limits
   grid=TRUE)                    # & w/ a grid

plot(zpf,                 # plot the effect of ZIPFFREQ
   xlab="Zipf frequency", # w/ this x-axis label
   ylab="RT in ms (middle 90%)", # w/ this y-axis label
   ylim=c(417, 1138),     # w/ these y-axis limits (middle 90%)
   grid=TRUE)             # & w/ a grid

And here a version with everything: observed values, predicted values, and the confidence band of the predictions/regression line:

par(mfrow=c(1,2))
plot(main="RT in ms as a function of Zipf frequency",
   pch=16, col="#00000030",
   xlab="Zipf frequency", xlim=c(  3,    6), x=d$ZIPFFREQ,
   ylab="RT in ms"      , ylim=c(250, 4250), y=d$RT); grid()
# add the regression line:
abline(m_02, col="red", lwd=2)
# add the confidence band:
polygon(
   c(zpf_d$ZIPFFREQ, rev(zpf_d$ZIPFFREQ)),
   c(zpf_d$lower, rev(zpf_d$upper)),
   col="#FF000030", border=NA)
# indicate the middle 90%:
abline(h=c(417, 1138), lty=3, col="blue")
plot(main="RT in ms as a function of Zipf frequency",
   frame.plot=FALSE, pch=16, col="#00000030",
   xlab="Zipf frequency"       , xlim=c(  3,    6), x=d$ZIPFFREQ,
   ylab="RT in ms (middle 90%)", ylim=c(417, 1138), y=d$RT)
grid(); box(col="blue", lty=3)
# add the regression line:
abline(m_02, col="red", lwd=2)
# add the confidence band:
polygon(col="#FF000030", border=NA,
   c(zpf_d$ZIPFFREQ, rev(zpf_d$ZIPFFREQ)),
   c(zpf_d$lower, rev(zpf_d$upper)))
par(mfrow=c(1,1))

3.4 Write-up

To determine whether the reaction time to a word (in ms) varies as a function of the frequency of that word (normalized to per million words), a linear model was fit with RT as the response variable and FREQPMW as a predictor. Exploration of all variables involved indicated the skewed distribution of FREQPMW, which is why a log-transformed version of this predictor was used in its place, ZIPFFREQ. The model was highly significant (F=133.3, df1=1, df2=7750, p<0.0001) but explains very little of the response variable (mult. R2=0.017, adj. R2=0.017). The coefficients indicate that ZIPFFREQ is negatively correlated with RT, as shown in the summary table and the figure shown here. [Show a plot] The model is potentially somewhat problematic because its residuals are not normally distributed and come with a very long right tail.

Estimate 95%-CI se t p2-tailed
Intercept (ZIPFFREQ=0) 949.36 [900.13, 998.59] 25.11 37.8 <0.001
ZIPFFREQ0→1 -62.46 [-73.06, -51.06] 5.41 -11.54 <0.001
# housekeeping: remove the predictions
str(d <- d[,1:9])
'data.frame':   8000 obs. of  9 variables:
 $ CASE    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ RT      : int  748 408 483 1266 636 395 765 1297 849 315 ...
 $ LENGTH  : int  5 5 6 4 6 6 8 4 5 6 ...
 $ LANGUAGE: Factor w/ 2 levels "english","spanish": 2 2 2 2 2 2 2 2 2 2 ...
 $ GROUP   : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
 $ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
 $ FREQPMW : int  19 78 233 133 14 67 34 114 2 20 ...
 $ ZIPFFREQ: num  4.28 4.89 5.37 5.12 4.15 ...
 $ SITE    : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...

4 A binary predictor

Does the reaction time to a word (in ms) vary as a function of which language that word was presented in (English vs. Spanish)?

4.1 Exploration & preparation

Some exploration of the relevant variables:

# the predictor(s)/response on its/their own
hist(d$RT); hist(d$RT, breaks="FD")

table(d$LANGUAGE)

english spanish
   4023    3977 
# the predictor(s) w/ the response
qwe <- boxplot(main="RT in ms as a function of stimulus language",
   d$RT ~ d$LANGUAGE, notch=TRUE,
   xlab="Stimulus language",
   ylab="RT in ms", ylim=c(250, 4250)); grid()

There are many outliers, let’s plot this again without them:

qwe$stats
     [,1] [,2]
[1,]  271  303
[2,]  477  548
[3,]  548  649
[4,]  658  810
[5,]  929 1202
# the predictor(s) w/ the response
boxplot(main="RT in ms as a function of stimulus language",
   d$RT ~ d$LANGUAGE, notch=TRUE, outline=FALSE,
   xlab="Stimulus language",
   ylab="RT in ms (no 'outliers')"); grid()

4.2 Modeling & numerical interpretation

How would we have dealt with this in Ling 104?

Thus, we might have done something like this …

t.test(d$RT ~ d$LANGUAGE, var.equal=TRUE) # the default t-test (Student)

    Two Sample t-test

data:  d$RT by d$LANGUAGE
t = -23.661, df = 7750, p-value < 2.2e-16
alternative hypothesis: true difference in means between group english and group spanish is not equal to 0
95 percent confidence interval:
 -147.9268 -125.2911
sample estimates:
mean in group english mean in group spanish
             594.9439              731.5528 

This t-test is actually also just a very simple linear regression model:

summary(m_01 <- lm(       # make/summarize the linear model m_01:
   RT ~ 1 +               # RT ~ an overall intercept (1)
   LANGUAGE,              # & the predictor LANGUAGE
   data=d,                # those vars are in d
   na.action=na.exclude)) # skip cases with NA/missing data (good habit)

Call:
lm(formula = RT ~ 1 + LANGUAGE, data = d, na.action = na.exclude)

Residuals:
   Min     1Q Median     3Q    Max
-428.6 -146.6  -58.9   69.1 3398.4

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)      594.944      4.029  147.66   <2e-16 ***
LANGUAGEspanish  136.609      5.774   23.66   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 254.1 on 7750 degrees of freedom
  (248 observations deleted due to missingness)
Multiple R-squared:  0.06737,   Adjusted R-squared:  0.06725
F-statistic: 559.8 on 1 and 7750 DF,  p-value: < 2.2e-16

Note the equivalence of the classical t-test and the linear model we fit (with the exception of the direction of the differences):

  • the t-statistic from the t-test (-23.6609065) squared is m_01’s F-value (559.838495);
  • the t-statistic from the t-test (-23.6609065) is also the (negative of the) t-statistic of LANGUAGEspa (23.6609065);
  • the df of the t-test (7750) is also the df2 of m_01 (7750);
  • the first mean reported by the t-test (mean in group eng, of 594.9439276) is the intercept of m_01 (of 594.9439276);
  • the second mean reported by the t-test (mean in group spa, of 731.5528477) is the sum of
    • the intercept of m_01 (of 594.9439276) and
    • the mean difference/slope of m_01 (of 136.6089201);
  • the confidence interval of the t-test (-147.9267615, -125.2910787) is the (negative) coefficient for LANGUAGEspa (of 136.6089201) plus/minus the product of
    • its standard error (of 5.7736131) and
    • the t-value that cuts off 2.5% of the area under the curve for df=7750 (qt(p=0.025, df=7750), i.e. -1.9602701).

The CI computation for the linear model is shown here, first manually again and then the easy way with Confint:

# lower bound for the intercept:
coef(m_01)[1] +              # the intercept plus
   summary(m_01)$coef[1,2] * # the intercept's se
   qt(0.025, df=7750)        # the t-value for 2.5% at that df
(Intercept)
    587.046 
# upper bound for the intercept:
coef(m_01)[1] +              # the intercept plus
   summary(m_01)$coef[1,2] * # the intercept's se
   qt(0.975, df=7750)        # the t-value for 2.5% at that df
(Intercept)
   602.8419 
# lower bound for the difference/slope:
coef(m_01)[2] +              # the difference/slope plus
   summary(m_01)$coef[2,2] * # the difference/slope's se
   qt(0.025, df=7750)        # the t-value for 2.5% at that df
LANGUAGEspanish
       125.2911 
# upper bound for the difference/slope:
coef(m_01)[2] +              # the difference/slope plus
   summary(m_01)$coef[2,2] * # the difference/slope's se
   qt(0.975, df=7750)        # the t-value for 2.5% at that df
LANGUAGEspanish
       147.9268 
Confint(m_01)
                Estimate    2.5 %   97.5 %
(Intercept)     594.9439 587.0460 602.8419
LANGUAGEspanish 136.6089 125.2911 147.9268

Back to the model: What is the deviance of our model with its one predictor?

deviance(m_01)
[1] 500329194

This seems to have reduced the deviance much from that of the null model, which was 5.3647159^{8} – again, how much is the reduction in %?

"/"(
   deviance(m_00)-deviance(m_01),
   deviance(m_00))
[1] 0.06737056

Which I am again hoping is familiar by now …

We again add the predictions of the model to the original data frame:

d$PRED <-   # make d$PRED the
   predict( # predictions for all cases
   m_01)    # based on m_01
head(d, 3) # check ...
  CASE  RT LENGTH LANGUAGE    GROUP CORRECT FREQPMW ZIPFFREQ SITE     PRED
1    1 748      5  spanish heritage correct      19 4.278754    b 731.5528
2    2 408      5  spanish heritage correct      78 4.892095    b 731.5528
3    3 483      6  spanish heritage correct     233 5.367356    b 731.5528
tail(d, 3) # the result
     CASE  RT LENGTH LANGUAGE   GROUP CORRECT FREQPMW ZIPFFREQ SITE     PRED
7998 8608 947      5  english english correct      80 4.903090    a 594.9439
7999 8609 660      4  english english correct     746 5.872739    a 594.9439
8000 8610 437      5  english english correct      54 4.732394    c 594.9439

4.3 Visual interpretation

Let’s visualize the nature of the effect of LANGUAGE based on the predictions from effects:

(lan_d <- data.frame( # make lan_d a data frame of
   lan <- effect(     # the effect
      "LANGUAGE",     # of LANGUAGE
      m_01)))         # in m_01
  LANGUAGE      fit       se    lower    upper
1  english 594.9439 4.029019 587.0460 602.8419
2  spanish 731.5528 4.135410 723.4463 739.6594
plot(lan,                        # plot the effect of LANGUAGE in m_01
   xlab="Language",              # w/ this x-axis label
   ylab="RT in ms",              # w/ this y-axis label
   ylim=range(d$RT, na.rm=TRUE), # w/ these y-axis limits
   grid=TRUE)                    # & w/ a grid

plot(lan,                        # plot the effect of LANGUAGE
   xlab="Language",              # w/ this x-axis label
   ylab="RT in ms (middle 90%)", # w/ this y-axis label
   ylim=c(417, 1138),            # w/ these y-axis limits (middle 90%)
   grid=TRUE)                    # & w/ a grid

And here a version with everything: observed values, predicted values, and the confidence intervals of the predictions:

par(mfrow=c(1,2))
stripchart(main="RT ~ LANGUAGE", # stripchart w/ main heading
   pch=16, col="#00000020",      # filled semi-transparent grey circles
   d$RT ~ d$LANGUAGE,            # data: RT ~ LANGUAGE
   xlab="Language", ylab="RT in ms", # axis labels
   vertical=TRUE,                # make the chart have a vertical y-axis
   method="jitter")              # jitter observations
# add horizontal grid lines at deciles & indicate the middle 90%:
abline(h=quantile(d$RT, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")
abline(h=c(417, 1138), lty=3, col="blue")
points(seq(lan_d$fit), lan_d$fit, pch="X", col="red") # add predicted means &
for (i in seq(lan_d$fit)) { # their confidence intervals
   arrows(i, lan_d$lower[i], i, lan_d$upper[i], code=3, angle=90, col="red")
}

stripchart(main="RT ~ LANGUAGE", # stripchart w/ main heading
   pch=16, col="#00000020",      # filled semi-transparent grey circles
   d$RT ~ d$LANGUAGE,            # data: RT ~ LANGUAGE
   xlab="Language", ylab="RT in ms (middle 90%)", # axis labels
   ylim=c(417, 1138),            # y-axis limits: middle 90%
   vertical=TRUE,                # make the chart have a vertical y-axis
   method="jitter",              # jitter observations
   frame.plot=FALSE)             # suppress box
box(col="blue", lty=3)
# add horizontal grid lines at deciles:
abline(h=quantile(d$RT, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")
points(seq(lan_d$fit), lan_d$fit, pch="X", col="red") # add predicted means &
for (i in seq(lan_d$fit)) { # their confidence intervals
   arrows(i, lan_d$lower[i], i, lan_d$upper[i], code=3, angle=90, col="red")
}
par(mfrow=c(1,1))

4.4 Write-up

To determine whether the reaction time to a word (in ms) varies as a function of the language the word was presented in (english vs. spanish), a linear model was fit with RT as the response variable and LANGUAGE as a predictor. The model was highly significant (F=559.8, df1=1, df2=7750, p<0.0001) but explains very little of the response variable (mult. R2=0.067, adj. R2=0.067). The coefficients indicate that reaction times are longer for Spanish words, as shown in the summary table and the figure shown here. [Show a plot] The model is potentially somewhat problematic because its residuals are not normally distributed and come with a very long right tail.

Estimate 95%-CI se t p2-tailed
Intercept (LANGUAGE=english) 594.94 [587.05, 602.84] 4.03 147.66 <0.001
LANGUAGEenglishspanish 136.61 [125.29, 147.93] 5.77 23.66 <0.001
# housekeeping: remove the predictions
str(d <- d[,1:9])
'data.frame':   8000 obs. of  9 variables:
 $ CASE    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ RT      : int  748 408 483 1266 636 395 765 1297 849 315 ...
 $ LENGTH  : int  5 5 6 4 6 6 8 4 5 6 ...
 $ LANGUAGE: Factor w/ 2 levels "english","spanish": 2 2 2 2 2 2 2 2 2 2 ...
 $ GROUP   : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
 $ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
 $ FREQPMW : int  19 78 233 133 14 67 34 114 2 20 ...
 $ ZIPFFREQ: num  4.28 4.89 5.37 5.12 4.15 ...
 $ SITE    : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...

5 A categorical predictor

Does the reaction time to a word (in ms) vary as a function of in which of three experimental sites the data were collected (a vs. b vs. c)?

5.1 Exploration & preparation

Some exploration of the relevant variables:

# the predictor(s)/response on its/their own
hist(d$RT); hist(d$RT, breaks="FD")

table(d$SITE)

   a    b    c
2403 2815 2782 
# the predictor(s) w/ the response
par(mfrow=c(1, 2)) # make the plotting window have 1 rows & 2 columns
qwe <- boxplot(main="RT in ms ~  experimental site",
   d$RT ~ d$SITE, notch=TRUE,
   xlab="Experimental site",
   ylab="RT in ms"         , ylim=c(250, 4250)); grid()
boxplot(main="RT in ms ~ experimental site",
   d$RT ~ d$SITE, notch=TRUE, outline=FALSE,
   xlab="Experimental site",
   ylab="RT in ms (no 'outliers')"); grid()
par(mfrow=c(1, 1)) # reset to default

It is noteworthy that site c differs considerably from the others in the outlier distribution; this would definitely require some attention!

5.2 Modeling & numerical interpretation

Let’s fit our regression model:

summary(m_01 <- lm(       # make/summarize the linear model m_01:
   RT ~ 1 +               # RT ~ an overall intercept (1)
   SITE,                  # & the predictor SITE
   data=d,                # those vars are in d
   na.action=na.exclude)) # skip cases with NA/missing data (good habit)

Call:
lm(formula = RT ~ 1 + SITE, data = d, na.action = na.exclude)

Residuals:
   Min     1Q Median     3Q    Max
-346.9 -158.5  -64.8   76.2 3416.1

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  603.449      5.292 114.040   <2e-16 ***
SITEb         60.334      7.204   8.375   <2e-16 ***
SITEc        110.467      7.386  14.956   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 259.4 on 7749 degrees of freedom
  (248 observations deleted due to missingness)
Multiple R-squared:  0.0281,    Adjusted R-squared:  0.02785
F-statistic:   112 on 2 and 7749 DF,  p-value: < 2.2e-16

The CI computation for the linear model is shown here, first manually again and then the easy way with Confint:

# lower bound for the intercept:
coef(m_01)[1] +              # the intercept plus
   summary(m_01)$coef[1,2] * # the intercept's se
   qt(0.025, df=7749)        # the t-value for 2.5% at that df
(Intercept)
   593.0765 
# upper bound for the intercept:
coef(m_01)[1] +              # the intercept plus
   summary(m_01)$coef[1,2] * # the intercept's se
   qt(0.975, df=7749)        # the t-value for 2.5% at that df
(Intercept)
   613.8223 
# lower bound for the difference/slope for SITE b (rel. to a):
coef(m_01)[2] +              # the difference/slope plus
   summary(m_01)$coef[2,2] * # the difference/slope's se
   qt(0.025, df=7749)        # the t-value for 2.5% at that df
   SITEb
46.21166 
# upper bound for the difference/slope for SITE b (rel. to a):
coef(m_01)[2] +              # the difference/slope plus
   summary(m_01)$coef[2,2] * # the difference/slope's se
   qt(0.975, df=7749)        # the t-value for 2.5% at that df
   SITEb
74.45678 
# lower bound for the difference/slope for SITE c (rel. to a):
coef(m_01)[3] +              # the difference/slope plus
   summary(m_01)$coef[3,2] * # the difference/slope's se
   qt(0.025, df=7749)        # the t-value for 2.5% at that df
   SITEc
95.98823 
# upper bound for the difference/slope for SITE c (rel. to a):
coef(m_01)[3] +              # the difference/slope plus
   summary(m_01)$coef[3,2] * # the difference/slope's se
   qt(0.975, df=7749)        # the t-value for 2.5% at that df
   SITEc
124.9456 
Confint(m_01)
             Estimate     2.5 %    97.5 %
(Intercept) 603.44944 593.07653 613.82235
SITEb        60.33422  46.21166  74.45678
SITEc       110.46690  95.98823 124.94557

What is the deviance of our model with its one predictor (but 2 contrasts from the intercept)?

deviance(m_01)
[1] 521397026

Again pretty bad – how much is the reduction in % relative to the null model?

"/"(
   deviance(m_00)-deviance(m_01),
   deviance(m_00))
[1] 0.02809946

As usual, we add the predictions of the model to the original data frame:

d$PRED <-   # make d$PRED the
   predict( # predictions for all cases
   m_01)    # based on m_01
head(d, 3) # check ...
  CASE  RT LENGTH LANGUAGE    GROUP CORRECT FREQPMW ZIPFFREQ SITE     PRED
1    1 748      5  spanish heritage correct      19 4.278754    b 663.7837
2    2 408      5  spanish heritage correct      78 4.892095    b 663.7837
3    3 483      6  spanish heritage correct     233 5.367356    b 663.7837
tail(d, 3) # the result
     CASE  RT LENGTH LANGUAGE   GROUP CORRECT FREQPMW ZIPFFREQ SITE     PRED
7998 8608 947      5  english english correct      80 4.903090    a 603.4494
7999 8609 660      4  english english correct     746 5.872739    a 603.4494
8000 8610 437      5  english english correct      54 4.732394    c 713.9163

5.3 Visual interpretation

Let’s visualize the nature of the effect of SITE based on the predictions from effects:

(sit_d <- data.frame( # make sit_d a data frame of
   sit <- effect   (  # the effect
      "SITE",         # of SITE
      m_01)))         # in m_01
  SITE      fit       se    lower    upper
1    a 603.4494 5.291570 593.0765 613.8223
2    b 663.7837 4.889025 654.1998 673.3675
3    c 713.9163 5.152976 703.8151 724.0176
plot(sit,                        # plot the effect of SITE in m_01
   xlab="Site",                  # w/ this x-axis label
   ylab="RT in ms",              # w/ this y-axis label
   ylim=range(d$RT, na.rm=TRUE), # w/ these y-axis limits
   grid=TRUE)                    # & w/ a grid

plot(sit,                        # plot the effect of SITE
   xlab="Site",                  # w/ this x-axis label
   ylab="RT in ms (middle 90%)", # w/ this y-axis label
   ylim=c(417, 1138),            # w/ these y-axis limits (middle 90%)
   grid=TRUE)                    # & w/ a grid

And here a version with everything: observed values, predicted values, and the confidence intervals of the predictions:

par(mfrow=c(1,2))
stripchart(main="RT ~ SITE", # stripchart w/ main heading
   pch=16, col="#00000020",  # filled semi-transparent grey circles
   d$RT ~ d$SITE,            # data: RT ~ SITE
   xlab="RT in ms", ylab="Site", # axis labels
   vertical=TRUE,            # make the chart have a vertical y-axis
   method="jitter")          # jitter observations
# add horizontal grid lines at deciles & indicate the middle 90%:
abline(h=quantile(d$RT, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")
abline(h=c(417, 1138), lty=3, col="blue")
points(seq(sit_d$fit), sit_d$fit, pch="X", col="red") # add predicted means &
for (i in seq(sit_d$fit)) { # their confidence intervals
   arrows(i, sit_d$lower[i], i, sit_d$upper[i], code=3, angle=90, col="red")
}
stripchart(main="RT ~ SITE", # stripchart w/ main heading
   pch=16, col="#00000020",  # filled semi-transparent grey circles
   d$RT ~ d$SITE,            # data: RT ~ SITE
   xlab="RT in ms", ylab="Site", # axis labels
   ylim=c(417, 1138),            # y-axis limits: middle 90%
   vertical=TRUE,            # make the chart have a vertical y-axis
   method="jitter",          # jitter observations
   frame.plot=FALSE)             # suppress box
box(col="blue", lty=3)
# add horizontal grid lines at deciles & indicate the middle 90%:
abline(h=quantile(d$RT, probs=seq(0, 1, 0.1), na.rm=TRUE), lty=3, col="#00000030")
points(seq(sit_d$fit), sit_d$fit, pch="X", col="red") # add predicted means &
for (i in seq(sit_d$fit)) { # their confidence intervals
   arrows(i, sit_d$lower[i], i, sit_d$upper[i], code=3, angle=90, col="red")
}
par(mfrow=c(1,1))

Excursus: This is an interesting case because only the very last plot shows something that we haven’t noticed before, namely the fact that both sites a and c have a surprising gap of RT values around 600 ms (specifically, no values in the interval [590, 602]); we can see this much more clearly in what is one of my favorite exploratory plots anyway, an ecdf plot:

plot(main="RT in ms ~ experimental site", type="n",
   xlab="RT in ms"             , xlim=c(7.75, 12.25), x=0,
   ylab="Cumulative proportion", ylim=c(0   ,  1   ), y=0); grid()
for (i in 1:3) {            # for each of the 3 sites
   lines(pch=".",           # plot lines w/ this point character
      ecdf(log2(d$RT[d$SITE==levels(d$SITE)[i]])), # namely ecdf lines
         col=rainbow(3)[i]) # using these colors
}
legend(title="Site",      # add a legend with the title "Site"
   x=10.5, xjust=0.5,     # at these x-coordinates (centered)
   y= 0.5, yjust=0.5,     # at these y-coordinates (centered)
   legend=levels(d$SITE), # these are the labels, show them ...
   fill=rainbow(3),       # w/ these colors
   ncol=3, bty="n")       # organize the legend in 3 columns, no box

5.4 Excursus: model comparison

In cases where one has multiple levels of a categorical predictor, sometimes the question arises as to whether (all) levels of that predictor are in fact significantly different from each other. In this particular case, the effects plot for SITE very clearly suggests that they are (because the confidence intervals of all predicted means do not overlap), but let’s discuss briefly how this can be tested with a method called model comparison: If you have two models that were fit to the same data set and where one model is a sub-model of the other model – typically such that they differ only by one predictor/term – then you can use the function anova to compare the two models to see whether the more complex model can justify its higher degree of complexity by being significantly better than the less complex model. In this case, 2 of the 3 possible comparisons of SITE are already in the summary table:

summary(m_01)$coefficients
             Estimate Std. Error    t value     Pr(>|t|)
(Intercept) 603.44944   5.291570 114.039769 0.000000e+00
SITEb        60.33422   7.204393   8.374643 6.514685e-17
SITEc       110.46690   7.386060  14.956134 7.021601e-50

From this, we know that SITE: a is significantly different from SITE: b and SITE: c, but we don’t know whether the 110.46690-60.33422=50.13268 difference between SITE: b and SITE: c is significant as well. To find that out, we can compare m_01 to a model in which SITE: b and SITE: c are not distinguished. For that, we first, we create a version of SITE that conflates levels as desired:

d$SITE_confl <-d$SITE # create a copy of SITE
levels(d$SITE_confl) <- # overwrite the copy's levels such that
   c("a",               # "a" remains "a"
     "b/c", "b/c")      # "b" & "c" become "b/c"
table(SITE=d$SITE, SITE_confl=d$SITE_confl) # check you did it right
    SITE_confl
SITE    a  b/c
   a 2403    0
   b    0 2815
   c    0 2782
head(d)
  CASE   RT LENGTH LANGUAGE    GROUP CORRECT FREQPMW ZIPFFREQ SITE     PRED
1    1  748      5  spanish heritage correct      19 4.278754    b 663.7837
2    2  408      5  spanish heritage correct      78 4.892095    b 663.7837
3    3  483      6  spanish heritage correct     233 5.367356    b 663.7837
4    4 1266      4  spanish heritage correct     133 5.123852    b 663.7837
5    5  636      6  spanish heritage correct      14 4.146128    b 663.7837
6    6  395      6  spanish heritage correct      67 4.826075    b 663.7837
  SITE_confl
1        b/c
2        b/c
3        b/c
4        b/c
5        b/c
6        b/c

We already have the model with SITE as the only predictor so we only need to fit a new model with SITE_confl as the only predictor:

summary(m_01_siteconfl <- lm( # make/summarize the linear model m_01_siteconfl:
   RT ~ 1 + SITE_confl,       # RT ~ an overall intercept (1) & SITE_confl
   data=d, na.action=na.exclude)) # the other arguments

Call:
lm(formula = RT ~ 1 + SITE_confl, data = d, na.action = na.exclude)

Residuals:
   Min     1Q Median     3Q    Max
-332.4 -155.5  -67.4   77.5 3442.5

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)    603.449      5.308  113.68   <2e-16 ***
SITE_conflb/c   84.084      6.390   13.16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 260.2 on 7750 degrees of freedom
  (248 observations deleted due to missingness)
Multiple R-squared:  0.02185,   Adjusted R-squared:  0.02173
F-statistic: 173.1 on 1 and 7750 DF,  p-value: < 2.2e-16

… and then we do a comparison with anova:

anova(m_01,        # compare the model w/ the 3-level predictor SITE
   m_01_siteconfl, # & the model w/ the 2-level predictor SITE_confl
   test="F")       # using an F-test
Analysis of Variance Table

Model 1: RT ~ 1 + SITE
Model 2: RT ~ 1 + SITE_confl
  Res.Df       RSS Df Sum of Sq      F    Pr(>F)
1   7749 521397026
2   7750 524748642 -1  -3351616 49.812 1.839e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference between the two models is hugely significant: The finer resolution of SITE compared to SITE_confl makes the model significantly better (p<0.00001). Thus, we now know not only the size of the 3rd comparison (SITE: b vs. c, which was clear from summary(m_01)), but also its p-value.

Another way of getting such results is available from emmeans::emmeans, as you saw in the book. Here,

  • lines 2 and 3 of the model’s summary table correspond to lines 1 and 2 of the emmeans output;
  • line 3 of the emmeans output corresponds to the result we obtained with the model conflation approach above: the F-value of the model conflation approach is the squared t-value of the emmeans approach:
summary(m_01)$coefficients
             Estimate Std. Error    t value     Pr(>|t|)
(Intercept) 603.44944   5.291570 114.039769 0.000000e+00
SITEb        60.33422   7.204393   8.374643 6.514685e-17
SITEc       110.46690   7.386060  14.956134 7.021601e-50
pairs(                # compute pairwise comparisons
    emmeans::emmeans( # of means computed
      object=m_01,    # for the model m_01
       ~ SITE),       # namely all contrasts of this predictor
   adjust="none")     # don't adjust for 3 comparisons (careful w/ this)
 contrast estimate   SE   df t.ratio p.value
 a - b       -60.3 7.20 7749  -8.375  <.0001
 a - c      -110.5 7.39 7749 -14.956  <.0001
 b - c       -50.1 7.10 7749  -7.058  <.0001

Finally, a very flexible, powerful, and quite complicated way of getting all sorts of such results here is also available from multcomp::glht. (I only show this here as an appetizer, but a discussion of cool things glht can do is offered in other courses):

c(0, 1, -1) %>% matrix(nrow=1) %>% multcomp::glht(m_01, .) %>% summary

     Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = RT ~ 1 + SITE, data = d, na.action = na.exclude)

Linear Hypotheses:
       Estimate Std. Error t value Pr(>|t|)
1 == 0  -50.133      7.103  -7.058 1.84e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

5.5 Write-up

To determine whether the reaction time to a word (in ms) varies as a function of the site where the data were collected (a vs. b vs. c), a linear model was fit with RT as the response variable and SITE as a predictor. The model was highly significant (F=112, df1=2, df2=7749, p<0.0001) but explains very little of the response variable (mult. R2=0.028, adj. R2=0.028). The coefficients indicate that reaction times are longer for Spanish words, as shown in the summary table and the figure shown here. [Show a plot] The model is potentially somewhat problematic because its residuals are not normally distributed and come with a very long right tail (esp. for SITE: c).

Estimate 95%-CI se t p2-tailed
Intercept (SITE=a) 603.45 [593.08, 613.82] 5.29 114.04 <0.001
SITEab 60.33 [46.21, 74.46] 7.2 8.38 <0.001
SITEac 110.47 [95.99, 124.95] 7.39 14.96 <0.001

A post-hoc comparison of the above model with one in which sites b and c were conflated indicates that sites b and c are also significantly different from each other (difference between means: 50.133, F=49.81/t=7.058, p<0.0001).

sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods
[8] base

other attached packages:
[1] effects_4.2-2  car_3.1-2      carData_3.0-5  STGmisc_1.0    Rcpp_1.0.12
[6] magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] Matrix_1.6-5      jsonlite_1.8.8    survey_4.4-2      splines_4.3.3
 [5] boot_1.3-30       yaml_2.3.8        fastmap_1.1.1     TH.data_1.1-2
 [9] lattice_0.22-6    coda_0.19-4.1     knitr_1.45        htmlwidgets_1.6.4
[13] MASS_7.3-60.0.1   nloptr_2.0.3      insight_0.19.10   nnet_7.3-19
[17] minqa_1.2.6       DBI_1.2.2         rlang_1.1.3       multcomp_1.4-25
[21] xfun_0.43         estimability_1.5  cli_3.6.2         emmeans_1.10.0
[25] digest_0.6.35     grid_4.3.3        xtable_1.8-4      rstudioapi_0.16.0
[29] mvtnorm_1.2-4     sandwich_3.1-0    lme4_1.1-35.1     nlme_3.1-164
[33] evaluate_0.23     codetools_0.2-19  zoo_1.8-12        mitools_2.4
[37] survival_3.5-8    abind_1.4-5       colorspace_2.1-0  rmarkdown_2.26
[41] tools_4.3.3       htmltools_0.5.8