rm(list=ls(all=TRUE)); invisible(gc())
set.seed(sum(utf8ToInt("Gummibärchen")))
invisible(sapply(c("car", "effects", "emmeans", "magrittr", "multcomp"),
library, character.only=TRUE, logical.return=TRUE,
quietly=TRUE, verbose=FALSE, warn.conflicts=TRUE))
invisible(sapply(dir("_helpers", full.names=TRUE), source))
Predictive modeling @ JLU Giessen: regression
1 Linear modeling
1.1 Introduction
Imagine you have a numeric variable such as the reaction times to words presented on a computer screen; the data we will work with are in _input/rts.csv, you can find information about the variables/columns in _input/rts.r, and our response variable is called RT
:
str(d <- read.delim( # the structure of d, the result of loading
file="_input/rts.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
'data.frame': 8000 obs. of 9 variables:
$ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
$ RT : int 748 408 483 1266 636 395 765 1297 849 315 ...
$ LENGTH : int 5 5 6 4 6 6 8 4 5 6 ...
$ LANGUAGE: Factor w/ 2 levels "english","spanish": 2 2 2 2 2 2 2 2 2 2 ...
$ GROUP : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
$ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
$ FREQPMW : int 19 78 233 133 14 67 34 114 2 20 ...
$ ZIPFFREQ: num 4.28 4.89 5.37 5.12 4.15 ...
$ SITE : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...
d$LANGUAGE <- factor(d$LANGUAGE, # for didactic reasons only, I am
levels=levels(d$LANGUAGE)[2:1]) # changing the order of the levels
summary(d)
CASE RT LENGTH LANGUAGE GROUP
Min. : 1 Min. : 271.0 Min. :3.000 spanish:3977 english :3961
1st Qu.:2150 1st Qu.: 505.0 1st Qu.:4.000 english:4023 heritage:4039
Median :4310 Median : 595.0 Median :5.000
Mean :4303 Mean : 661.5 Mean :5.198
3rd Qu.:6450 3rd Qu.: 732.0 3rd Qu.:6.000
Max. :8610 Max. :4130.0 Max. :9.000
NA's :248
CORRECT FREQPMW ZIPFFREQ SITE
correct :7749 Min. : 1.00 Min. :3.000 a:2403
incorrect: 251 1st Qu.: 17.00 1st Qu.:4.230 b:2815
Median : 42.00 Median :4.623 c:2782
Mean : 81.14 Mean :4.591
3rd Qu.: 101.00 3rd Qu.:5.004
Max. :1152.00 Max. :6.061
There are 7752 reaction times in the data. Your job is to choose one number that, if it is used to predict every single reaction time, results in the best overall prediction for all the data. What would you pick?
Let’s see how good each of these is, where we determine goodness on the basis of the sum of the squared differences – why? Because (i) that makes all differences positive and (ii) that penalizes bad predictions more:
(possible_squared_deviations <- setNames(c(
(d$RT - possible_picks[1])^2 %>% sum(na.rm=TRUE),
(d$RT - possible_picks[2])^2 %>% sum(na.rm=TRUE),
(d$RT - possible_picks[3])^2 %>% sum(na.rm=TRUE),
(d$RT - possible_picks[4])^2 %>% sum(na.rm=TRUE),
(d$RT - possible_picks[5])^2 %>% sum(na.rm=TRUE),
(d$RT - possible_picks[6])^2 %>% sum(na.rm=TRUE)),
possible_picks))
# sapply(possible_picks, \(af) sum((d$RT-af)^2, na.rm=TRUE))
300 600 700 900 1400 2000
1549343984 565761584 547980784 977539184 4764635184 14425470384
Here are the corresponding results for my guess …
… which is better than all other results above:
300 600 700 900 1400 2000
TRUE TRUE TRUE TRUE TRUE TRUE
Why, what did I guess? What everyone should have guessed: a measure of central tendency, the statistic you would provide for a variable of interest, a response variable, if you are only allowed to provide a single value. How can we compute this here?
And then, if we evaluate how well this does along the lines from above, we get the result we already saw:
A different way to have R answer that question is with a linear model. But since right now we’re only looking at the response variable RT
right now – we have no predictor(s)! – this model is what is called the null model:
summary(m_00 <- lm( # make/summarize the linear model m_00:
RT ~ 1, # RT ~ an overall intercept (1) & no predictor
data=d, # those variables are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT ~ 1, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-390.5 -156.5 -66.5 70.5 3468.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 661.469 2.988 221.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 263.1 on 7751 degrees of freedom
(248 observations deleted due to missingness)
This is the same as the sum of the squared residuals – what is that sum?
[1] 536471586
[1] 536471586
Another way to get this more easily is this:
How/where does this null model guess/predict the overall mean?
What do the other columns mean? The null model is the same as if you use a one-sample t-test to test whether the mean of reaction times is 0:
[1] 2.988048
One Sample t-test
data: d$RT
t = 221.37, df = 7751, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
655.6111 667.3259
sample estimates:
mean of x
661.4685
Since our above deviance value is the overall amount of variability of the response variable and, at the same time, an expression of how bad a model with no predictors is at making predictions, we are now of course hoping that our predictors can reduce that number …
1.2 A binary predictor
1.2.1 Modeling & numerical interpretation
Let us now apply this and linear modeling to a scenario where we have a binary predictor. Imagine we’re asking, does the reaction time to a word (in ms) vary as a function of which language that word was presented in (Spanish vs. English)?
Given what we said so far, what is going to be the best strategy to make the best guesses, i.e. reduce your deviance? You should guess the mean for each level of the response:
[1] 731.5528
[1] 594.9439
How do we do this better?
tapply( # apply to
d$RT, # this variable
d$LANGUAGE, # a grouping by this one
mean, na.rm=TRUE) # return means for each group
spanish english
731.5528 594.9439
Note, there are two ways to represent this:
- we get a mean for each group (like above);
- we get one mean (e.g., the one for
LANGUAGE
: spanish) and the info how, from that mean, we can compute the other (i.e. spanish→english).
What R does by default is option 2:
summary(m_01 <- lm( # make/summarize the linear model m_01:
RT ~ 1 + # RT ~ an overall intercept (1)
LANGUAGE, # & the predictor LANGUAGE
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT ~ 1 + LANGUAGE, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-428.6 -146.6 -58.9 69.1 3398.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 731.553 4.135 176.90 <2e-16 ***
LANGUAGEenglish -136.609 5.774 -23.66 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 254.1 on 7750 degrees of freedom
(248 observations deleted due to missingness)
Multiple R-squared: 0.06737, Adjusted R-squared: 0.06725
F-statistic: 559.8 on 1 and 7750 DF, p-value: < 2.2e-16
First, is the predictor LANGUAGE
significant? This is easy here because
- the model has just this one predictor so the p-value of the whole model is the p-value of its only predictor;
- the predictor has 1 df (because it is binary), which, means that the p-value in the coefficients table for
LANGUAGEenglish
is also the p-value of the predictor.
This predictor is actually like using a binary logical (or numeric 0-1 coded) vector as a predictor that answers the question ‘is LANGUAGE
english?’:
d$is_LANGUAGE_english <- d$LANGUAGE=="english"
summary(lm( # make/summarize the linear model:
RT ~ 1 + # RT ~ an overall intercept (1)
is_LANGUAGE_english, # & the predictor LANGUAGE
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT ~ 1 + is_LANGUAGE_english, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-428.6 -146.6 -58.9 69.1 3398.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 731.553 4.135 176.90 <2e-16 ***
is_LANGUAGE_englishTRUE -136.609 5.774 -23.66 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 254.1 on 7750 degrees of freedom
(248 observations deleted due to missingness)
Multiple R-squared: 0.06737, Adjusted R-squared: 0.06725
F-statistic: 559.8 on 1 and 7750 DF, p-value: < 2.2e-16
We can see this in particular if we consider two simple but important functions: model.frame
and model.matrix
. The model frame is all the data (rows and columns) used in the model:
RT LANGUAGE
1 748 spanish
2 408 spanish
3 483 spanish
4 1266 spanish
5 636 spanish
6 395 spanish
[1] 7752 2
The model matrix is how the model internally represents the data and from what you compute predictions:
RT LANGUAGE
1 748 spanish
2 408 spanish
3 483 spanish
(Intercept) LANGUAGEenglish
1 1 0
2 1 0
3 1 0
RT LANGUAGE
7998 947 english
7999 660 english
8000 437 english
(Intercept) LANGUAGEenglish
7998 1 1
7999 1 1
8000 1 1
In model.matrix(m_01)
, the 2nd column is called LANGUAGEenglish
and it is what we computed above as d$is_LANGUAGE_english <- d$LANGUAGE=="english"
.
So what do the coefficients mean or represent?
The coefficients are used like this: For each case x, you
- multiply the coefficients with row x from the model matrix;
- sum up the result,
and that is the prediction for that case:
[1] 731.5528
[1] 594.9439
That also means
- the intercept is the prediction when the predictor is its first level (spanish);
- the 2nd coefficient is how you get from the intercept to the prediction for when the predictor is its 2nd level (english), and that difference is significant.
But there is an easier way to get the predictions – we use predict
– and it’s usually a good idea to add them to the data frame:
CASE RT LENGTH LANGUAGE GROUP CORRECT FREQPMW ZIPFFREQ SITE
1 1 748 5 spanish heritage correct 19 4.278754 b
2 2 408 5 spanish heritage correct 78 4.892095 b
3 3 483 6 spanish heritage correct 233 5.367356 b
is_LANGUAGE_english PRED
1 FALSE 731.5528
2 FALSE 731.5528
3 FALSE 731.5528
CASE RT LENGTH LANGUAGE GROUP CORRECT FREQPMW ZIPFFREQ SITE
7998 8608 947 5 english english correct 80 4.903090 a
7999 8609 660 4 english english correct 746 5.872739 a
8000 8610 437 5 english english correct 54 4.732394 c
is_LANGUAGE_english PRED
7998 TRUE 594.9439
7999 TRUE 594.9439
8000 TRUE 594.9439
How good is this model? For that we need the deviance again:
How much is this an improvement in % of the original deviance?
What is that? R-squared!
Let’s compute the confidence intervals of our coefficients real quick – how?
Estimate 2.5 % 97.5 %
(Intercept) 731.5528 723.4463 739.6594
LANGUAGEenglish -136.6089 -147.9268 -125.2911
Here, interpreting the coefficients and making the predictions is very easy, but here is a useful way to do this especially for when models become more complex: We compute the effects
predictions of the model:
(lan_d <- data.frame( # make lan_d a data frame of
lan <- effect( # the effect
"LANGUAGE", # of LANGUAGE
m_01))) # in m_01
LANGUAGE fit se lower upper
1 spanish 731.5528 4.135410 723.4463 739.6594
2 english 594.9439 4.029019 587.0460 602.8419
# or
(lan_d <- data.frame( # make lan_d a data frame of
lan <- predictorEffect( # the effect
"LANGUAGE", # of LANGUAGE
m_01))) # in m_01
LANGUAGE fit se lower upper
1 spanish 731.5528 4.135410 723.4463 739.6594
2 english 594.9439 4.029019 587.0460 602.8419
1.2.2 Visual interpretation
Let’s visualize the nature of the effect of LANGUAGE
based on the predictions from effects
:
plot(lan, # plot the effect of LANGUAGE in m_01
xlab="Language", # w/ this x-axis label
ylab="RT in ms", # w/ this y-axis label
ylim=range(d$RT, na.rm=TRUE), # w/ these y-axis limits
grid=TRUE) # & w/ a grid
plot(lan, # plot the effect of LANGUAGE
xlab="Language", # w/ this x-axis label
ylab="RT in ms (middle 90%)", # w/ this y-axis label
ylim=c(417, 1137.45), # w/ these y-axis limits (middle 90%)
grid=TRUE) # & w/ a grid
And here a version with everything: observed values, predicted values, and the confidence intervals of the predictions:
1.2.3 Write-up
To determine whether the reaction time to a word (in ms) varies as a function of the language the word was presented in (english vs. spanish), a linear model was fit with RT
as the response variable and LANGUAGE
as a predictor. The model was highly significant (F=559.8, df1=1, df2=7750, p<0.0001) but explains very little of the response variable (mult. R2=0.067, adj. R2=0.067). The coefficients indicate that reaction times are longer for Spanish words, as shown in the summary table and the figure shown here. [Show a plot]
Estimate | 95%-CI | se | t | p2-tailed | |
---|---|---|---|---|---|
Intercept (LANGUAGE=spanish) | 731.55 | [723.45, 739.66] | 4.14 | 176.9 | <0.001 |
LANGUAGEspanish→english | -136.61 | [125.29, 147.93] | 5.77 | -23.66 | <0.001 |
Some housekeeping:
'data.frame': 8000 obs. of 9 variables:
$ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
$ RT : int 748 408 483 1266 636 395 765 1297 849 315 ...
$ LENGTH : int 5 5 6 4 6 6 8 4 5 6 ...
$ LANGUAGE: Factor w/ 2 levels "spanish","english": 1 1 1 1 1 1 1 1 1 1 ...
$ GROUP : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
$ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
$ FREQPMW : int 19 78 233 133 14 67 34 114 2 20 ...
$ ZIPFFREQ: num 4.28 4.89 5.37 5.12 4.15 ...
$ SITE : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...
Now, why is this model actually crap? That’s the topic of the next section.
1.2.4 Diagnostics 1
Just like most statistical tests, linear models require certain assumptions to be met in order for their results to be trustworthy. The first assumption is concerned with the number of data points we need. For linear models like ours, one rule of thumb (!) is that the number of data points in the model (!) must be greater than the product of the number of values estimated in the model and 15 or, better even, 20.
How many data points do we have? It’s not nrow(d)
! – it’s this:
How many values are estimated in the model? The best way to get that number is this:
So I think we’re just about ok …
c("we have enough data points"=">"(
nrow(model.frame(m_01)),
20 * (m_01 %>% logLik %>% attr("df"))))
we have enough data points
TRUE
Note: one should minimally also check that one has enough data points for all (combinations of) levels of predictors and their interactions!
Another important one set of assumptions is that the residuals are
- approximately normally distributed;
- homoscedastic (i.e., here, exhibit a fairly even spread across the range of predicted values).
Let’s look at this here:
par(mfrow=c(2, 2)) # make the plotting window have 2 rows & 2 columns
hist(d$RT, breaks="FD") # a histogram of the response
hist(residuals(m_01), main="Residuals") # a histogram of the residuals
plot(m_01, which=1:2) # plot model-diagnostic plots 1 & 2
par(mfrow=c(1, 1)) # reset to default
Is this good or bad? It’s bad … How might one fix this? With a transformation, for example:
par(mfrow=c(1, 2)) # make the plotting window have 1 row & 2 columns
hist(log2(d$RT)) # a histogram of ...
hist(log2(d$RT), breaks="FD") # the logged response
par(mfrow=c(1, 1)) # reset to default
Much better, we will work with this from now on:
How does the new response look like with the predictor?
# the predictor(s) w/ the response
boxplot(main="RT in ms (logged) ~ stimulus language", varwidth=TRUE,
d$RT_log ~ d$LANGUAGE, notch=TRUE,
xlab="Stimulus language",
ylab="RT in ms (logged)", ylim=c(8, 12)); grid()
Does that improve the diagnostics?
m_02 <- lm( # make the linear model m_02:
RT_log ~ 1 + # RT_log ~ an overall intercept (1)
LANGUAGE, # & the predictor LANGUAGE
data=d, # those vars are in d
na.action=na.exclude) # skip cases with NA/missing data (good habit)
par(mfrow=c(2, 2)) # make the plotting window have 2 rows & 2 columns
hist(d$RT_log, breaks="FD") # a histogram of the response
hist(residuals(m_02), main="Residuals") # a histogram of the residuals
plot(m_02, which=1:2) # plot model-diagnostic plots 1 & 2
par(mfrow=c(1, 1)) # reset to default
Yes, so we’ll use this response variable from now on. Also, we will reduce the data set to the complete cases:
'data.frame': 7752 obs. of 10 variables:
$ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
$ RT_log : num 9.55 8.67 8.92 10.31 9.31 ...
$ GROUP : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
$ SITE : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...
$ LENGTH : int 5 5 6 4 6 6 8 4 5 6 ...
$ ZIPFFREQ: num 4.28 4.89 5.37 5.12 4.15 ...
$ LANGUAGE: Factor w/ 2 levels "spanish","english": 1 1 1 1 1 1 1 1 1 1 ...
$ RT : int 748 408 483 1266 636 395 765 1297 849 315 ...
$ FREQPMW : int 19 78 233 133 14 67 34 114 2 20 ...
$ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
Anything else we need to do? We need a new null model because our response has changed:
1.3 A categorical predictor
Let us now extend this to a scenario where we have a categorical predictor. Imagine we’re asking, does the (now logged) reaction time to a word (in ms) vary as a function of in which of three experimental sites the data were collected (a vs. b vs. c)?
This time we’re smarter than before and begin with some exploration:
a b c
2403 2815 2534
par(mfrow=c(1,2))
hist(d$RT_log, breaks="FD")
boxplot(main="RT in ms (logged) ~ experimental site", varwidth=TRUE,
d$RT_log ~ d$SITE, notch=TRUE,
xlab="Experimental site",
ylab="RT in ms", ylim=c(8, 12)); grid()
par(mfrow=c(1,1))
It is noteworthy that site c differs considerably from the others in the outlier distribution; this would definitely require some attention!
1.3.1 Modeling & numerical interpretation
Given what we said so far, what is going to be the best strategy to make the best guesses, i.e. reduce your deviance? You should guess the mean for each level of the response:
tapply( # apply to
d$RT_log, # this variable
d$SITE, # a grouping by this one
mean, na.rm=TRUE) # return means for each group
a b c
9.191055 9.311756 9.357216
Note, there are three ways to represent this:
- we get a mean for each group (like above);
- we get one mean (e.g., the one for
SITE
: a) and the info how, from that mean, we can compute each of the other two (i.e. a→b and a→c); - we get one mean (e.g., the one for
SITE
: a) and the info how we can compute each of the other two successively (i.e. a→b and b→c).
What R does by default is option 2:
summary(m_01 <- lm( # make/summarize the linear model m_01:
RT_log ~ 1 + # RT_log ~ an overall intercept (1)
SITE, # & the predictor SITE
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT_log ~ 1 + SITE, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.10891 -0.30613 -0.08053 0.23524 2.65471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.191055 0.009111 1008.810 <2e-16 ***
SITEb 0.120700 0.012404 9.731 <2e-16 ***
SITEc 0.166161 0.012717 13.066 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4466 on 7749 degrees of freedom
Multiple R-squared: 0.02295, Adjusted R-squared: 0.0227
F-statistic: 91 on 2 and 7749 DF, p-value: < 2.2e-16
Estimate 2.5 % 97.5 %
(Intercept) 9.1910552 9.17319560 9.2089148
SITEb 0.1207003 0.09638472 0.1450159
SITEc 0.1661612 0.14123245 0.1910899
We can see this in particular of we return to model.matrix
:
RT_log SITE
10 8.299208 a
1 9.546894 b
7 9.579316 c
(Intercept) SITEb SITEc
10 1 0 0
1 1 1 0
7 1 0 1
In model.matrix(m_01)
,
- the 2nd column is what we would compute as
d$SITE=="b"
; - the 3rd column is what we would compute as
d$SITE=="c"
.
But again, is the predictor SITE
significant? This is easy here because the model has just this one predictor so the p-value of the whole model is the p-value of its only predictor. However, the predictor has >1 df (because it is categorical with >2 levels), which, means that the p-values in the coefficients table does not tell us what the p-value for ‘all of SITE
’ is. So, then, what do the coefficients mean or represent?
The coefficients are used like they we just said above: For each case x, you
- multiply the coefficients with row x from the model matrix;
- sum up the result,
and that is the prediction for that case:
[1] 9.191055
[1] 9.311756
[1] 9.357216
In other words,
- the intercept is the prediction when the predictor is its first level (a);
- the 2nd coefficient is how you get from the intercept to the prediction when the predictor is its 2nd level (b), and that difference is significant;
- the 3rd coefficient is how you get from the intercept to the prediction for when the predictor is its 2nd level (c), and that difference is significant;
Thus, this is what this model predicts and we again add it to the data frame d
:
CASE RT_log GROUP SITE LENGTH ZIPFFREQ LANGUAGE RT FREQPMW CORRECT
10 10 8.299208 heritage a 6 4.301030 spanish 315 20 correct
1 1 9.546894 heritage b 5 4.278754 spanish 748 19 correct
7 7 9.579316 heritage c 8 4.531479 spanish 765 34 correct
PRED
10 9.191055
1 9.311756
7 9.357216
Is this an acceptable model (in terms of diagnostics)? It’s not that bad …
par(mfrow=c(2, 2)) # make the plotting window have 2 rows & 2 columns
hist(d$RT_log, breaks="FD") # a histogram of the response
hist(residuals(m_01), main="Residuals") # a histogram of the residuals
plot(m_01, which=1:2) # plot model-diagnostic plot 1
par(mfrow=c(1, 1)) # reset to default
How good is this model? For that we need the deviance again:
How much is this an improvement in % of the original deviance?
What is that again? R-squared.
Again, interpreting the coefficients and making the predictions is very easy, but we again use the effects package for getting all relevant predictions for our predictor:
(sit_d <- data.frame( # make sit_d a data frame of
sit <- effect( # the effect
"SITE", # of SITE
m_01))) # in m_01
SITE fit se lower upper
1 a 9.191055 0.009110786 9.173196 9.208915
2 b 9.311756 0.008417702 9.295255 9.328256
3 c 9.357216 0.008872161 9.339825 9.374608
(sit_d <- data.frame( # make sit_d a data frame of
sit <- predictorEffect( # the effect
"SITE", # of SITE
m_01))) # in m_01
SITE fit se lower upper
1 a 9.191055 0.009110786 9.173196 9.208915
2 b 9.311756 0.008417702 9.295255 9.328256
3 c 9.357216 0.008872161 9.339825 9.374608
1.3.2 Visual interpretation
Let’s visualize the nature of the effect of SITE
based on the predictions from effects
:
plot(sit, # plot the effect of SITE in m_01
xlab="Site", # w/ this x-axis label
ylab="RT in ms", # w/ this y-axis label
ylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limits
grid=TRUE) # & w/ a grid
And here a version with everything: observed values, predicted values, and the confidence intervals of the predictions:
Excursus: This is interesting because only the very last plot shows something that we haven’t noticed before, namely the fact that both sites a and c have a surprising gap of RT values around 600 ms (specifically, no values in the interval [590, 602]); we can see this much more clearly in what is one of my favorite exploratory plots anyway, an ecdf plot:
This example allows us to pursue a variety of important notions and follow-up questions. Look at the model summary again:
Call:
lm(formula = RT_log ~ 1 + SITE, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.10891 -0.30613 -0.08053 0.23524 2.65471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.191055 0.009111 1008.810 <2e-16 ***
SITEb 0.120700 0.012404 9.731 <2e-16 ***
SITEc 0.166161 0.012717 13.066 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4466 on 7749 degrees of freedom
Multiple R-squared: 0.02295, Adjusted R-squared: 0.0227
F-statistic: 91 on 2 and 7749 DF, p-value: < 2.2e-16
Here are some questions:
- what is the difference between the reaction times that the model predicts for
SITE
: b and forSITE
: c? - is that difference significant – how do we find that out?
- what might we need to pay attention to when we look at that?
1.3.3 Excursus 1: the difference
What is the difference between the reaction times that the model predicts for SITE
: b and for SITE
: c?
1.3.4 Excursus 2: is it significant?
1.3.4.1 Conflation via model comparison
In cases where one has multiple levels of a categorical predictor, sometimes the question arises as to whether (all) levels of that predictor are in fact significantly different from each other. In this particular case, the effects plot for SITE
very clearly suggests that they are (because the confidence intervals of all predicted means do not overlap), but let’s discuss briefly how this can be tested with a method called model comparison: If you have two models that were fit to the same data set and where one model is a sub-model of the other model – typically such that they differ only by one predictor/term – then you can use the function anova
to compare the two models to see whether the more complex model can justify its higher degree of complexity by being significantly better than the less complex model. In this case, 2 of the 3 possible comparisons of SITE
are already in the summary table:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.1910552 0.009110786 1008.810364 0.000000e+00
SITEb 0.1207003 0.012404198 9.730601 2.992632e-22
SITEc 0.1661612 0.012716983 13.066084 1.313862e-38
From this, we could already compute the predictions for all three sites and we already know that SITE
: a is significantly different from SITE
: b and SITE
: c, but we don’t know whether the 0.04546087 difference between SITE
: b and SITE
: c is significant as well. To find that out, we can compare m_01
to a model in which SITE
: b and SITE
: c are not distinguished, i.e. a model that is missing exactly the difference we’re interested in. For that, we first, we create a version of SITE
that conflates levels as desired:
d$SITE_confl <- d$SITE # create a copy of SITE
levels(d$SITE_confl) <- # overwrite the copy's levels such that
c("a", # "a" remains "a"
"b/c", "b/c") # "b" & "c" become "b/c"
table(SITE=d$SITE, SITE_confl=d$SITE_confl) # check you did it right
SITE_confl
SITE a b/c
a 2403 0
b 0 2815
c 0 2534
CASE RT_log GROUP SITE LENGTH ZIPFFREQ LANGUAGE RT FREQPMW CORRECT
1 1 9.546894 heritage b 5 4.278754 spanish 748 19 correct
2 2 8.672425 heritage b 5 4.892095 spanish 408 78 correct
3 3 8.915879 heritage b 6 5.367356 spanish 483 233 correct
4 4 10.306062 heritage b 4 5.123852 spanish 1266 133 correct
5 5 9.312883 heritage b 6 4.146128 spanish 636 14 correct
6 6 8.625709 heritage b 6 4.826075 spanish 395 67 correct
PRED SITE_confl
1 9.311756 b/c
2 9.311756 b/c
3 9.311756 b/c
4 9.311756 b/c
5 9.311756 b/c
6 9.311756 b/c
We already have the model with SITE
as the only predictor so we only need to fit a new model with SITE_confl
as the only predictor:
summary(m_01_siteconfl <- lm( # make/summarize the linear model m_01_siteconfl:
RT_log ~ 1 + SITE_confl, # RT_log ~ an overall intercept (1) & SITE_confl
data=d, na.action=na.exclude)) # the other arguments
Call:
lm(formula = RT_log ~ 1 + SITE_confl, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.10891 -0.30841 -0.08063 0.23555 2.67863
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.191055 0.009118 1007.98 <2e-16 ***
SITE_conflb/c 0.142237 0.010977 12.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.447 on 7750 degrees of freedom
Multiple R-squared: 0.02121, Adjusted R-squared: 0.02108
F-statistic: 167.9 on 1 and 7750 DF, p-value: < 2.2e-16
… and then we do a comparison with anova
:
anova(m_01, # compare the model w/ the 3-level predictor SITE
m_01_siteconfl, # & the model w/ the 2-level predictor SITE_confl
test="F") %>% data.frame # using an F-test
Res.Df RSS Df Sum.of.Sq F Pr..F.
1 7749 1545.650 NA NA NA NA
2 7750 1548.406 -1 -2.756055 13.81728 0.0002028949
The difference between the two models is highly significant: The finer resolution of SITE
compared to SITE_confl
makes the model significantly better (p<0.0003). Thus, we now know not only the size of the 3rd comparison (SITE
: b vs. c, which was computable from summary(m_01)
), but also its p-value.
1.3.4.2 Post-hoc comparisons
A for now final way of getting such results is available from emmeans::emmeans
:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.1910552 0.009110786 1008.810364 0.000000e+00
SITEb 0.1207003 0.012404198 9.730601 2.992632e-22
SITEc 0.1661612 0.012716983 13.066084 1.313862e-38
# the following is pretty much the same as sit_d
data.frame(qwe <- emmeans( # compute estimated marginal means
object=m_01, # for the model m_01
specs= ~ SITE)) # namely all contrasts of this predictor
SITE emmean SE df lower.CL upper.CL
1 a 9.191055 0.009110786 7749 9.173196 9.208915
2 b 9.311756 0.008417702 7749 9.295255 9.328256
3 c 9.357216 0.008872161 7749 9.339825 9.374608
From this object qwe
with all its means, we can compute all pairwise comparisons:
pairs(qwe, # pairwise comparisons of means
adjust="none") %>% # careful w/ this!
data.frame # show as a data frame
contrast estimate SE df t.ratio p.value
1 a - b -0.12070030 0.01240420 7749 -9.730601 2.992632e-22
2 a - c -0.16616117 0.01271698 7749 -13.066084 1.313862e-38
3 b - c -0.04546087 0.01223000 7749 -3.717160 2.028949e-04
In our case,
- lines 1 and 2 of the
pairs(emmeans, ...)
output correspond to lines 2 and 3 of the model’s summary table; - line 3 of the
pairs(emmeans, ...)
output corresponds to the result we obtained with either the model conflation or thesdif
approach above.
Now, why does it it say adjust="none") # careful w/ this!
above ? Because of how larger number of levels can lead to many many pairwise comparisons:
[1] 3 4 5 6 7 8 9 10
[1] 2 3 4 5 6 7 8 9
(possible_post_tests <- "*"(
numbers_of_levels_of_cat_predictors,
numbers_of_levels_of_cat_predictors-1) / 2)
[1] 3 6 10 15 21 28 36 45
par(mfrow=c(1,2))
plot(pch=16, type="b",
x=numbers_of_levels_of_cat_predictors,
y=possible_post_tests, ylim=c(0, 50)); grid()
plot(ylim=c(0,1), pch=16, type="b",
x=numbers_of_levels_of_cat_predictors,
y=how_many_in_sum/possible_post_tests); grid()
par(mfrow=c(1,1))
1.3.4.3 Multiple post-hoc tests
But what does that mean for our p-values? The work worse and worse …
If we do 1 significance test with a significance level of 0.05, what is the chance we make the right decision when we adopt H1?
If we do 2 significance tests with a significance level of 0.05, what is the chance we always make the right decision when we adopt H1?
If we do 3 significance tests with a significance level of 0.05, what is the chance we always make the right decision when we adopt H1?
If we do 13 significance tests with a significance level of 0.05, what is the chance we always make the right decision when we adopt H1?
Thus, if we test the living hell out of a predictor by testing all possible comparisons, we need to protect ourselves against this – how?
[1] 0.01666667
[1] 0.01695243
[1] 0.00625
[1] 0.006391151
But many would say these are too conservative. My advice, use p.adjust
:
1.3.5 Write-up
To determine whether the reaction time to a word (in ms) varies as a function of the site where the data were collected (a vs. b vs. c), a linear model was fit with RT
as the response variable and SITE
as a predictor. The model was highly significant (F=112, df1=2, df2=7749, p<0.0001) but explains very little of the response variable (mult. R2=0.028, adj. R2=0.028). The coefficients indicate that reaction times are longer for Spanish words, as shown in the summary table and the figure shown here. [Show a plot] The model is potentially somewhat problematic because its residuals are not normally distributed and come with a very long right tail (esp. for SITE
: c).
Estimate | 95%-CI | se | t | p2-tailed | |
---|---|---|---|---|---|
Intercept (SITE=a) | 9.191 | [9.173, 9.209] | 0.009 | 1008.81 | <0.001 |
SITEa→b | 0.121 | [0.096, 0.145] | 0.12 | 9.73 | <0.001 |
SITEa→c | 0.166 | [0.144, 0.191] | 0.13 | 13.07 | <0.001 |
A post-hoc comparison of the above model with one in which sites b and c were conflated indicates that sites b and c are also significantly different from each other (difference between means: 0.04546, F=13.817/t=3.717, p<0.0003).
Some housekeeping:
'data.frame': 7752 obs. of 10 variables:
$ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
$ RT_log : num 9.55 8.67 8.92 10.31 9.31 ...
$ GROUP : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
$ SITE : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...
$ LENGTH : int 5 5 6 4 6 6 8 4 5 6 ...
$ ZIPFFREQ: num 4.28 4.89 5.37 5.12 4.15 ...
$ LANGUAGE: Factor w/ 2 levels "spanish","english": 1 1 1 1 1 1 1 1 1 1 ...
$ RT : int 748 408 483 1266 636 395 765 1297 849 315 ...
$ FREQPMW : int 19 78 233 133 14 67 34 114 2 20 ...
$ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
1.4 A numeric predictor
Let us now extend this to a scenario where we have a numeric predictor. Imagine we’re asking, does the reaction time to a word (in ms) vary as a function of the reaction time to a word (in ms) vary as a function of the frequency of that word (normalized to per million words)?
This time we’re smarter than before and begin with exploration:
# hist(d$RT_log); hist(d$RT_log, breaks="FD")
par(mfrow=c(1,2))
hist(d$FREQPMW); hist(d$FREQPMW, breaks="FD")
par(mfrow=c(1,1))
Thoughts? That looks terrible again …
Let’s also look at the pairwise relationship:
#| fig-show: hold
# the predictor(s) w/ the response
plot(pch=16, col="#00000020",
x=d$FREQPMW,
y=d$RT_log)
abline(v=quantiles(d$FREQPMW), lty=3, col="#00000040")
abline(h=quantiles(d$RT_log) , lty=3, col="#00000040")
Thoughts? Obviously, we can’t work with this – let’s use ZIPFFREQ
instead:
par(mfrow=c(1,3))
# the predictor(s)/response on its/their own
hist(d$ZIPFFREQ); hist(d$ZIPFFREQ, breaks="FD")
# the predictor(s) w/ the response
plot(pch=16, col="#00000020",
x=d$ZIPFFREQ,
y=d$RT_log)
abline(v=quantiles(d$ZIPFFREQ), lty=3, col="#00000040")
abline(h=quantiles(d$RT_log) , lty=3, col="#00000040")
par(mfrow=c(1,1))
1.4.1 Modeling & numerical interpretation
How would we have dealt with this without a linear model? We would have computed a Pearson product-moment correlation coefficient r (or Spearman’s ρ, either with cor.test
). Thus, we might have done something like this (for Pearson’s r) …
Pearson's product-moment correlation
data: d$RT_log and d$ZIPFFREQ
t = -12.939, df = 7750, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1671408 -0.1235589
sample estimates:
cor
-0.1454204
But we want to fit a linear model so here goes:
summary(m_01 <- lm( # make/summarize the linear model m_01:
RT_log ~ 1 + # RT_log ~ an overall intercept (1)
ZIPFFREQ, # & the predictor ZIPFFREQ (which involves logging)
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT_log ~ 1 + ZIPFFREQ, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.21421 -0.30499 -0.07338 0.22546 2.75795
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.84212 0.04303 228.72 <2e-16 ***
ZIPFFREQ -0.11995 0.00927 -12.94 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.447 on 7750 degrees of freedom
Multiple R-squared: 0.02115, Adjusted R-squared: 0.02102
F-statistic: 167.4 on 1 and 7750 DF, p-value: < 2.2e-16
Estimate 2.5 % 97.5 %
(Intercept) 9.8421186 9.7577652 9.9264721
ZIPFFREQ -0.1199524 -0.1381246 -0.1017802
Is this an acceptable model (in terms of diagnostics)? It’s not that bad, I suppose …
par(mfrow=c(2, 2)) # make the plotting window have 2 rows & 2 columns
hist(d$RT_log, breaks="FD") # a histogram of the response
hist(residuals(m_01), main="Residuals") # a histogram of the residuals
plot(m_01, which=1:2) # plot model-diagnostic plot 1
par(mfrow=c(1, 1)) # reset to default
Now, first again, is the predictor ZIPFFREQ
significant? This is easy here because the model has just this one predictor so the p-value of the whole model is the p-value of its only predictor. Also, the predictor has 1 df (because it is numeric), which, means that the p-value in the coefficients table also tells us what the p-value for ‘all of ZIPFFREQ
’ is.
Second, note the equivalence of the classical Pearson’s r and the linear model we fit:
- the t-statistic from Pearson’s r (-12.9394984) squared is
m_01
’s F-value (167.4306191); - the t-statistic from Pearson’s r (-12.9394984) is also the t-statistic of
ZIPFFREQ
(-12.9394984); - the df of Pearson’s r (7750) is also the df2 of
m_01
(7750); - the value of Pearson’s r (-0.1454204) squared is
m_01
’s Multiple R-squared (0.0211471).
In this case, there aren’t means to estimate for levels of the predictor simply because the predictor has no levels: it is numeric. But we can informally think about this as a continuous extension of how we dealt with predictors so far, especially the binary case. There, one level (the first one, spanish) was considered 0/FALSE
and mapped onto the intercept. the other (english) was considered 1/TRUE
. But here, it’s a tiny bit different – look at the model.matrix
:
RT_log ZIPFFREQ
216 9.147205 3
35 9.228819 4
(Intercept) ZIPFFREQ
216 1 3
35 1 4
In model.matrix(m_01)
, the 2nd column is now just the value of the numeric predictor and otherwise the logic from above is the same. The coefficients …
… are used like this: For each case x, you
- multiply the coefficients with row x from the model matrix;
- sum up the result,
and that is the prediction for that case:
[1] 9.482261
[1] 9.362309
In other words,
- we use the intercept for when the numeric predictor is 0;
- we use the coefficient to represent how much the prediction changes when the value of the numeric predictor changes from from 0/
FALSE
to 1/TRUE
, which makes it a slope, and this slope is significantly different from 0.
Thus, this is what this model predicts:
CASE RT_log GROUP SITE LENGTH ZIPFFREQ LANGUAGE RT FREQPMW CORRECT
216 234 9.147205 heritage c 5 3 spanish 567 1 correct
35 36 9.228819 heritage b 3 4 spanish 600 10 correct
PRED
216 9.482261
35 9.362309
How good is this model? For that we need the deviance again:
How much is this an improvement in % of the original deviance?
What is that again? R-squared.
Let’s again compute effects predictions for this predictor, but we do so in a way that can be didactically useful: we do not let the effects
package guess a few values, we define a bunch of ‘relevant’ values ourselves:
(relevants_z <- c(
0:1, # these help us understand the slope
3:6, # these cover the range of the actual values
mean(d$ZIPFFREQ)) %>% sort) # the central tendency
[1] 0.000000 1.000000 3.000000 4.000000 4.609476 5.000000 6.000000
Then, we compute the effects predictions of the model for these relevant values:
(zpf_d <- data.frame( # make zpf_d a data frame of
zpf <- effect( # the effect
"ZIPFFREQ", # of ZIPFFREQ
m_01, # in m_01
# but predictions for our relevant values:
xlevels=list(ZIPFFREQ=relevants_z))))
ZIPFFREQ fit se lower upper
1 0.000000 9.842119 0.043031556 9.757765 9.926472
2 1.000000 9.722166 0.033843723 9.655823 9.788509
3 3.000000 9.482261 0.015760356 9.451367 9.513156
4 4.000000 9.362309 0.007595874 9.347419 9.377199
5 4.609476 9.289201 0.005076888 9.279249 9.299153
6 5.000000 9.242356 0.006235465 9.230133 9.254580
7 6.000000 9.122404 0.013854241 9.095246 9.149562
1.4.2 Visual interpretation
Let’s visualize the nature of the effect of ZIPFFREQ
based on the predictions from effects
; note, this one is a bit annoying now because we used the didactically useful vector relevants_z
, which now results in an x-axis that covers the interpretationally useless value range [0, 3):
plot(zpf, # plot the effect of ZIPFFREQ in m_02
xlab="Zipf frequency", # w/ this x-axis label
ylab="RT in ms (logged)", # w/ this y-axis label
ylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limits
grid=TRUE) # & w/ a grid
So here is a version with everything: observed values, predicted values, and the (barely visible) confidence band of the predictions/regression line:
1.4.3 Write-up
To determine whether the reaction time to a word (in ms) varies as a function of the frequency of that word (normalized to per million words), a linear model was fit with RT_log
as the response variable and ZIPFFREQ
as a predictor (because exploration of FREQPMW
indicated a very skewed distribution of FREQPMW
, which is why the log-transformed version of this predictor was used in its place). The model was highly significant (F=167.4, df1=1, df2=7750, p<0.0001) but explains very little of the response variable (mult. R2=0.021, adj. R2=0.021). The coefficients indicate that ZIPFFREQ
is negatively correlated with RT_log
, as shown in the summary table and the figure shown here. [Show a plot]
Estimate | 95%-CI | se | t | p2-tailed | |
---|---|---|---|---|---|
Intercept (ZIPFFREQ=0) | 9.842 | [9.758, 9.926] | 0.043 | 228.72 | <0.001 |
ZIPFFREQ0→1 | -0.12 | [-0.138, -0.102] | 0.009 | -12.94 | <0.001 |
Some housekeeping:
'data.frame': 7752 obs. of 10 variables:
$ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
$ RT_log : num 9.55 8.67 8.92 10.31 9.31 ...
$ GROUP : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
$ SITE : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...
$ LENGTH : int 5 5 6 4 6 6 8 4 5 6 ...
$ ZIPFFREQ: num 4.28 4.89 5.37 5.12 4.15 ...
$ LANGUAGE: Factor w/ 2 levels "spanish","english": 1 1 1 1 1 1 1 1 1 1 ...
$ RT : int 748 408 483 1266 636 395 765 1297 849 315 ...
$ FREQPMW : int 19 78 233 133 14 67 34 114 2 20 ...
$ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
1.4.4 Just to really drive this home …
This is what one would have done with the unprepared response and predictor:
par(mfrow=c(1,2))
plot(
main="RT in ms ~ freq pmw",
pch=16, col="#00000030",
xlab="Frequency pmw", xlim=c( 0, 1200) , x=d$FREQPMW,
ylab="RT in ms" , ylim=c(417, 1137.45), y=d$RT)
abline(lm(RT ~ FREQPMW, data=d), col="red", lwd=2)
abline(h=quantiles(d$RT) , lty=3, col="#00FF0080")
abline(v=quantiles(d$FREQPMW), lty=3, col="#00FF0080")
hist(residuals(lm(RT ~ FREQPMW, data=d)), breaks="FD", main="", xlab="Residuals")
par(mfrow=c(1,1))
… and this is the better thing we did:
par(mfrow=c(1,2))
plot(
main="RT in ms (logged) ~ of Zipf scale",
pch=16, col="#00000030",
xlab="Zipf scaled" , xlim=c(3, 6), x=d$ZIPFFREQ,
ylab="RT in ms (log2ged)", ylim=c(8, 12), y=d$RT_log)
abline(lm(RT_log ~ ZIPFFREQ, data=d), col="red", lwd=2)
abline(h=quantiles(d$RT_log) , lty=3, col="#00FF0080")
abline(v=quantiles(d$ZIPFFREQ), lty=3, col="#00FF0080")
hist(residuals(lm(RT_log ~ ZIPFFREQ, data=d)), breaks="FD", main="", xlab="Residuals")
par(mfrow=c(1,1))
1.5 Two categorical predictors
Let us now extend this to a scenario where we have two categorical predictors. Imagine we’re asking, does the logged reaction time to a word (in ms) vary as a function of
- the language that the stimulus word was presented in:
LANGUAGE
: spanish vs. english; - the site where the data were collected:
SITE
: a vs. b vs. c.
a b c
spanish 1094 1401 1280
english 1309 1414 1254
: spanish
: a
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.243 9.056 9.301 9.305 9.564 10.051
------------------------------------------------------------
: english
: a
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.082 8.873 9.061 9.095 9.331 10.050
------------------------------------------------------------
: spanish
: b
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.539 9.140 9.331 9.427 9.697 10.359
------------------------------------------------------------
: english
: b
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.520 8.928 9.150 9.197 9.385 10.353
------------------------------------------------------------
: spanish
: c
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.539 9.103 9.401 9.527 9.739 12.012
------------------------------------------------------------
: english
: c
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.520 8.899 9.095 9.184 9.371 11.341
Let’s also visualize:
# the predictor(s) w/ the response
boxplot(main="RT in ms (logged) ~ LANGUAGE + SITE", varwidth=TRUE,
d$RT_log ~ d$LANGUAGE+d$SITE, notch=TRUE,
xlab="Predictor combinations",
ylab="RT in ms (logged)" , ylim=c(8, 12)); grid()
1.5.1 Modeling & numerical interpretation
Let’s fit a model:
summary(m_01 <- lm( # make/summarize the linear model m_01:
RT_log ~ 1 + # RT_log ~ an overall intercept (1)
LANGUAGE + SITE, # & the predictors
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT_log ~ 1 + LANGUAGE + SITE, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.09009 -0.29309 -0.06378 0.22331 2.52552
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.333261 0.010196 915.406 <2e-16 ***
LANGUAGEenglish -0.261055 0.009716 -26.869 <2e-16 ***
SITEb 0.109625 0.011872 9.234 <2e-16 ***
SITEc 0.153143 0.012173 12.580 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4272 on 7748 degrees of freedom
Multiple R-squared: 0.1062, Adjusted R-squared: 0.1059
F-statistic: 306.9 on 3 and 7748 DF, p-value: < 2.2e-16
Estimate 2.5 % 97.5 %
(Intercept) 9.3332611 9.31327465 9.3532475
LANGUAGEenglish -0.2610548 -0.28010081 -0.2420088
SITEb 0.1096246 0.08635282 0.1328964
SITEc 0.1531434 0.12928028 0.1770066
As before: are all 2 predictors significant? Here, this is more difficult: LANGUAGE
has 1 df so its p-value is clear from the coefficients table, but SITE
has >2 df so its p-value is not clear, and we cannot glean in from the model’s overall p-value, because this one includes whatever LANGUAGE
contributes. What can we do? The long answer goes back to model comparison: Above we said this:
a method called **model comparison**: If you have two models that were fit to
the same data set and where one model is a sub-model of the other model --
typically such that they differ only by one predictor/term -- then you can use
the function `anova` to compare the two models to see whether the more complex
model can justify its higher degree of complexity by being significantly better
than the less complex model.
So we use anova
again:
summary(m_02l <- lm( # make/summarize the linear model m_02l:
RT_log ~ 1 + # RT_log ~ an overall intercept (1)
LANGUAGE , # & only the predictor LANGUAGE
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT_log ~ 1 + LANGUAGE, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.18270 -0.28696 -0.06671 0.21618 2.58605
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.425872 0.007027 1341.44 <2e-16 ***
LANGUAGEenglish -0.266402 0.009810 -27.16 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4317 on 7750 degrees of freedom
Multiple R-squared: 0.08688, Adjusted R-squared: 0.08677
F-statistic: 737.4 on 1 and 7750 DF, p-value: < 2.2e-16
Res.Df RSS Df Sum.of.Sq F Pr..F.
1 7748 1413.909 NA NA NA NA
2 7750 1444.505 -2 -30.59622 83.83123 9.569434e-37
# better:
summary(m_02l <- update( # make m_02l an update
m_01, .~. # of m_01, all of it
- SITE)) # but then not SITE
Call:
lm(formula = RT_log ~ LANGUAGE, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.18270 -0.28696 -0.06671 0.21618 2.58605
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.425872 0.007027 1341.44 <2e-16 ***
LANGUAGEenglish -0.266402 0.009810 -27.16 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4317 on 7750 degrees of freedom
Multiple R-squared: 0.08688, Adjusted R-squared: 0.08677
F-statistic: 737.4 on 1 and 7750 DF, p-value: < 2.2e-16
Clearly, SITE
makes a hugely significant contribution to m_01
. Let’s do this for LANGUAGE
as well, just for practice and completeness:
summary(m_02s <- update( # make m_02s an update
m_01, .~. # of m_01, all of it
- LANGUAGE)) # but then not LANGUAGE
Call:
lm(formula = RT_log ~ SITE, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.10891 -0.30613 -0.08053 0.23524 2.65471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.191055 0.009111 1008.810 <2e-16 ***
SITEb 0.120700 0.012404 9.731 <2e-16 ***
SITEc 0.166161 0.012717 13.066 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4466 on 7749 degrees of freedom
Multiple R-squared: 0.02295, Adjusted R-squared: 0.0227
F-statistic: 91 on 2 and 7749 DF, p-value: < 2.2e-16
Res.Df RSS Df Sum.of.Sq F Pr..F.
1 7748 1413.909 NA NA NA NA
2 7749 1545.650 -1 -131.7407 721.9181 4.054422e-152
Thankfully, there is a shorter way to do check the significance of droppable predictors: drop1
returns what we had to create with anova
stepwise:
Single term deletions
Model:
RT_log ~ 1 + LANGUAGE + SITE
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 1413.9 -13183
LANGUAGE 1 131.741 1545.7 -12494 721.918 < 2.2e-16 ***
SITE 2 30.596 1444.5 -13021 83.831 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, what do the coefficients mean, for which we return to the data and the model matrix:
RT_log LANGUAGE SITE
10 8.299208 spanish a
9 9.729621 spanish b
7 9.579316 spanish c
101 9.066089 english a
103 9.100662 english b
93 8.836050 english c
(Intercept) LANGUAGEenglish SITEb SITEc
10 1 0 0 0
9 1 0 1 0
7 1 0 0 1
101 1 1 0 0
103 1 1 1 0
93 1 1 0 1
The model matrix shows that we basically now have a combination of what we saw for each categorical predictor above and we can see that
- the intercept is the prediction for when
-
LANGUAGE
is spanish and -
SITE
is a (see case 10);
-
- coefficient 2 is how we get from the intercept to the prediction for when
-
LANGUAGE
changes from spanish to english, but -
SITE
isn’t mentioned, it remains the intercept value a (see case 101);
-
- coefficient 3 is how we get from the intercept to the prediction for when
-
LANGUAGE
isn’t mentioned, it remains the intercept value spanish , but -
SITE
changes from a to b (see case 9);
-
- coefficient 4 is how we get from the intercept to the prediction for when
-
LANGUAGE
isn’t mentioned, it remains the intercept value spanish , but -
SITE
changes from a to c (see case 7).
-
But with these 4 coefficients, we can also make predictions for when both LANGUAGE
changes to english AND SITE
changes to b or c; see case 103 and 93 respectively, where we essentially just add up the separate effects of LANGUAGE
and SITE
. The coefficients are therefore used like above:
"*"( # prediction for case 103 (LANGUAGE: english, SITE: b):
coef(m_01),
model.matrix(m_01)[103,]) %>% sum
[1] 9.181831
"*"( # prediction for case 93 (LANGUAGE: english, SITE: c):
coef(m_01),
model.matrix(m_01)[93,]) %>% sum
[1] 9.22535
Thus, this is what this model predicts and we again add it to the data frame d
:
CASE RT_log GROUP SITE LENGTH ZIPFFREQ LANGUAGE RT FREQPMW CORRECT
10 10 8.299208 heritage a 6 4.301030 spanish 315 20 correct
9 9 9.729621 heritage b 5 3.301030 spanish 849 2 correct
7 7 9.579316 heritage c 8 4.531479 spanish 765 34 correct
101 110 9.066089 heritage a 6 3.477121 english 536 3 correct
103 112 9.100662 heritage b 4 4.579784 english 549 38 correct
93 101 8.836050 heritage c 6 4.869232 english 457 74 correct
PRED
10 9.333261
9 9.442886
7 9.486405
101 9.072206
103 9.181831
93 9.225350
How good is this model? Same as above:
[1] 1413.909
[1] 0.1062247
Let’s compute emmeans
and effects
results again: We begin with what emmeans
does if you force it to produce all six means at the same time:
data.frame(qwe <- emmeans( # compute estimated marginal means
object=m_01, # for the model m_01
specs= ~ LANGUAGE:SITE)) # namely all combinations of the interaction
LANGUAGE SITE emmean SE df lower.CL upper.CL
1 spanish a 9.333261 0.010195760 7748 9.313275 9.353248
2 english a 9.072206 0.009772786 7748 9.053049 9.091364
3 spanish b 9.442886 0.009415164 7748 9.424429 9.461342
4 english b 9.181831 0.009391984 7748 9.163420 9.200242
5 spanish c 9.486405 0.009753654 7748 9.467285 9.505524
6 english c 9.225350 0.009803181 7748 9.206133 9.244567
But let’s now look at the effects of the predictors individually. First with emmeans
again:
LANGUAGE emmean SE df lower.CL upper.CL
1 spanish 9.420850 0.006969945 7748 9.407187 9.434513
2 english 9.159796 0.006778074 7748 9.146509 9.173082
SITE emmean SE df lower.CL upper.CL
1 a 9.202734 0.008725265 7748 9.185630 9.219838
2 b 9.312358 0.008051530 7748 9.296575 9.328141
3 c 9.355877 0.008486334 7748 9.339242 9.372513
What are those? Simple: The predictions emmeans
gives you for one predictor are the averages of the predictions for that predictor over the levels of the other:
[1] 9.42085
[1] 9.159796
[1] 9.202734
[1] 9.312358
[1] 9.355877
But what is more appropriate is the pairwise differences for each predictor:
contrast estimate SE df t.ratio p.value
spanish - english 0.261 0.00972 7748 26.869 <.0001
Results are averaged over the levels of: SITE
contrast estimate SE df t.ratio p.value
a - b -0.1096 0.0119 7748 -9.234 <.0001
a - c -0.1531 0.0122 7748 -12.580 <.0001
b - c -0.0435 0.0117 7748 -3.720 0.0002
Results are averaged over the levels of: LANGUAGE
Ok, now the effects
predictions, which will be more interesting …:
LANGUAGE emmean SE df lower.CL upper.CL
spanish 9.421 0.00697 7748 9.407 9.435
english 9.160 0.00678 7748 9.147 9.173
Results are averaged over the levels of: SITE
Confidence level used: 0.95
(lan_d <- data.frame( # make lan_d a data frame of
lan <- effect( # the effect
"LANGUAGE", # of LANGUAGE
m_01))) # in m_01
LANGUAGE fit se lower upper
1 spanish 9.423129 0.006956061 9.409494 9.436765
2 english 9.162075 0.006776940 9.148790 9.175359
SITE emmean SE df lower.CL upper.CL
a 9.203 0.00873 7748 9.186 9.220
b 9.312 0.00805 7748 9.297 9.328
c 9.356 0.00849 7748 9.339 9.373
Results are averaged over the levels of: LANGUAGE
Confidence level used: 0.95
(sit_d <- data.frame( # make sit_d a data frame of
sit <- predictorEffect( # the effect
"SITE", # of SITE
m_01))) # in m_01
SITE fit se lower upper
1 a 9.199332 0.008719875 9.182239 9.216426
2 b 9.308957 0.008052173 9.293173 9.324741
3 c 9.352476 0.008488022 9.335837 9.369115
?! How do these effects
results relate to the values from emmeans
? So, there are three different ways in which R reports results and you have to know what you want:
- the coefficients in the
summary
output are based on everything categorical being its first level (and everything numeric being 0) and then you get the differences between means (and slopes); -
emmeans
default predictions are based on averages for the predictions forPREDICTOR
over the levels of another predictor (and everything numeric being the mean; see below); -
effects
default predictions are based on averages for the predictions forPREDICTOR
over the levels of another predictor but weighted by the frequencies of these other levels (and everything numeric being the mean; see below)!
Thus, this should work:
LANGUAGE fit se lower upper
1 spanish 9.423129 0.006956061 9.409494 9.436765
2 english 9.162075 0.006776940 9.148790 9.175359
[1] 9.423129
[1] 9.162075
And then the same for SITE
:
SITE fit se lower upper
1 a 9.199332 0.008719875 9.182239 9.216426
2 b 9.308957 0.008052173 9.293173 9.324741
3 c 9.352476 0.008488022 9.335837 9.369115
[1] 9.199332
[1] 9.308957
[1] 9.352476
Very cool …
Note: you can get effects
kind of results with the emmeans
package, too, if you give emmeans
an additional argument weights="outer"
(because its default is weights="equal"
). And you can get emmeans
results with the effects package, too, if you give effects an additional argument fixed.predictors=list(given.values="equal")
(because its default is fixed.predictors=list(given.values="default")
).)
1.5.2 Visual interpretation
Let’s visualize each predictor’s effect:
1.5.3 Modeling with interactions
Let’s look at the predictions of the model again:
LANGUAGE SITE emmean SE df lower.CL upper.CL
spanish a 9.333 0.01020 7748 9.313 9.353
english a 9.072 0.00977 7748 9.053 9.091
spanish b 9.443 0.00942 7748 9.424 9.461
english b 9.182 0.00939 7748 9.163 9.200
spanish c 9.486 0.00975 7748 9.467 9.506
english c 9.225 0.00980 7748 9.206 9.245
Confidence level used: 0.95
Let’s display this a little more conveniently for a moment:
# with the data frame version of qwe
with(data.frame(qwe),
# apply to the predictions in this column
tapply(emmean,
# a grouping by the combinations of these two columns
list(SITE, LANGUAGE),
# and list the 1 value you get there
c))
spanish english
a 9.333261 9.072206
b 9.442886 9.181831
c 9.486405 9.225350
The critical thing to realize here is that
- the effect of
LANGUAGE
: spanish→english is the same everywhere, i.e. regardless of whetherSITE
is a, b, or c; - the effect of
SITE
: a→b is the same everywhere, i.e. regardless of whetherLANGUAGE
is spanish or english; - the effect of
SITE
: b→c is the same everywhere, i.e. regardless of whetherLANGUAGE
is spanish or english.
For pretty obvious reasons, this is called additive behavior. But what if additivity is not a realist or accurate assumption? Maybe the effect of LANGUAGE
(or SITE
) is not the same regardless of the value of SITE
(or LANGUAGE
). If that’s the case, we call this an interaction: something doesn’t do the same everywhere/always, one of the most important notions you can learn.
How do we check for this in R? Like this:
summary(m_02 <- lm( # make/summarize the linear model m_01:
RT_log ~ 1 + # RT_log ~ an overall intercept (1)
LANGUAGE + SITE + # & the predictors
LANGUAGE:SITE, # & their interaction <----------------------------------------------
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT_log ~ 1 + LANGUAGE + SITE + LANGUAGE:SITE, data = d,
na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.06224 -0.28150 -0.06249 0.22645 2.48487
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.30541 0.01289 722.057 < 2e-16 ***
LANGUAGEenglish -0.20993 0.01746 -12.023 < 2e-16 ***
SITEb 0.12208 0.01720 7.098 1.37e-12 ***
SITEc 0.22164 0.01755 12.629 < 2e-16 ***
LANGUAGEenglish:SITEb -0.02048 0.02373 -0.863 0.388
LANGUAGEenglish:SITEc -0.13327 0.02433 -5.479 4.42e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4263 on 7746 degrees of freedom
Multiple R-squared: 0.1103, Adjusted R-squared: 0.1098
F-statistic: 192.1 on 5 and 7746 DF, p-value: < 2.2e-16
Estimate 2.5 % 97.5 %
(Intercept) 9.30541146 9.28014876 9.33067417
LANGUAGEenglish -0.20992980 -0.24415822 -0.17570138
SITEb 0.12208024 0.08836735 0.15579314
SITEc 0.22164497 0.18724048 0.25604947
LANGUAGEenglish:SITEb -0.02047856 -0.06699429 0.02603716
LANGUAGEenglish:SITEc -0.13327174 -0.18095638 -0.08558711
Is the interaction significant? We can test this again the (slightly) long(er) or the shorter way:
anova( # compare this
m_01, # this model is the
m_02, # to this one long way
test="F") # with an F-test
Analysis of Variance Table
Model 1: RT_log ~ 1 + LANGUAGE + SITE
Model 2: RT_log ~ 1 + LANGUAGE + SITE + LANGUAGE:SITE
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7748 1413.9
2 7746 1407.4 2 6.4931 17.868 1.811e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# in the future, we will make the output of anova data frames to reduce the output
drop1(m_02, test="F") %>% data.frame # the short way
Df Sum.of.Sq RSS AIC F.value Pr..F.
<none> NA NA 1407.416 -13214.43 NA NA
LANGUAGE:SITE 2 6.493064 1413.909 -13182.75 17.86795 1.810901e-08
In other words, these predictions from m_01
are significantly worse than what we could report. But how do we get new predictions now? As always: from the coefficients and the model matrix:
RT_log LANGUAGE SITE
10 8.299208 spanish a
9 9.729621 spanish b
7 9.579316 spanish c
101 9.066089 english a
103 9.100662 english b
93 8.836050 english c
(Intercept) LANGUAGEenglish SITEb SITEc LANGUAGEenglish:SITEb
10 1 0 0 0 0
9 1 0 1 0 0
7 1 0 0 1 0
101 1 1 0 0 0
103 1 1 1 0 1
93 1 1 0 1 0
LANGUAGEenglish:SITEc
10 0
9 0
7 0
101 0
103 0
93 1
The model matrix shows that we basically now have a combination of what we saw for each categorical predictor above and we can see that
- the intercept is the prediction for when
-
LANGUAGE
is spanish and -
SITE
is a (see case 10);
-
- coefficient 2 is how we get from the intercept to the prediction for when
-
LANGUAGE
changes from spanish to english, but -
SITE
isn’t mentioned, it remains the intercept value a (see case 101);
-
- coefficient 3 is how we get from the intercept to the prediction for when
-
LANGUAGE
isn’t mentioned, it remains the intercept value spanish, but -
SITE
changes from a to b (see case 9);
-
- coefficient 4 is how we get from the intercept to the prediction for when
-
LANGUAGE
isn’t mentioned, it remains the intercept value spanish, but -
SITE
changes from a to c (see case 7).
-
But now we do not just have additive effects, which means the predictions for LANGUAGE
: english and SITE
: b as well as for LANGUAGE
: english and SITE
: c are different in how they require an additional addition, the interaction term:
- coefficient 5 is what you add to the intercept, the main effect of
LANGUAGE
: spanish→english, and the main effect ofSITE
: a→b when these predictor values coincide (see case 103); - coefficient 6 is what you add to the intercept, the main effect of
LANGUAGE
: spanish→english, and the main effect ofSITE
: a→c when these predictor values coincide (see case 93).
The coefficients are therefore used like above:
"*"( # prediction for case 101 (LANGUAGE: english, SITE: a):
coef(m_02),
model.matrix(m_02)[101,]) %>% sum
[1] 9.095482
# same as 9.30541 + -0.20993
"*"( # prediction for case 103 (LANGUAGE: english, SITE: b):
coef(m_02),
model.matrix(m_02)[103,]) %>% sum
[1] 9.197083
So let’s generate all predictions again and also add them to the data frame d
:
In a case like this – the interaction includes all predictors – all the results from emmeans
…
LANGUAGE SITE emmean SE df lower.CL upper.CL
1 spanish a 9.305411 0.01288736 7746 9.280149 9.330674
2 english a 9.095482 0.01178156 7746 9.072387 9.118577
3 spanish b 9.427492 0.01138816 7746 9.405168 9.449816
4 english b 9.197083 0.01133569 7746 9.174862 9.219304
5 spanish c 9.527056 0.01191427 7746 9.503701 9.550412
6 english c 9.183855 0.01203715 7746 9.160259 9.207451
SITE LANGUAGE emmean SE df lower.CL upper.CL
1 a spanish 9.305411 0.01288736 7746 9.280149 9.330674
2 b spanish 9.427492 0.01138816 7746 9.405168 9.449816
3 c spanish 9.527056 0.01191427 7746 9.503701 9.550412
4 a english 9.095482 0.01178156 7746 9.072387 9.118577
5 b english 9.197083 0.01133569 7746 9.174862 9.219304
6 c english 9.183855 0.01203715 7746 9.160259 9.207451
LANGUAGE SITE emmean SE df lower.CL upper.CL
1 spanish a 9.305411 0.01288736 7746 9.280149 9.330674
2 english a 9.095482 0.01178156 7746 9.072387 9.118577
3 spanish b 9.427492 0.01138816 7746 9.405168 9.449816
4 english b 9.197083 0.01133569 7746 9.174862 9.219304
5 spanish c 9.527056 0.01191427 7746 9.503701 9.550412
6 english c 9.183855 0.01203715 7746 9.160259 9.207451
and effects
are the same:
(lasi_d <- data.frame( # make lasi_d a data frame of
lasi <- effect( # the effect
"LANGUAGE:SITE", # of LANGUAGE:SITE
m_02))) # in m_02
LANGUAGE SITE fit se lower upper
1 spanish a 9.305411 0.01288736 9.280149 9.330674
2 english a 9.095482 0.01178156 9.072387 9.118577
3 spanish b 9.427492 0.01138816 9.405168 9.449816
4 english b 9.197083 0.01133569 9.174862 9.219304
5 spanish c 9.527056 0.01191427 9.503701 9.550412
6 english c 9.183855 0.01203715 9.160259 9.207451
But note of course that emmeans
can give you nice post-hoc tests:
contrast SITE estimate SE df t.ratio p.value
1 spanish - english a 0.2099298 0.01746107 7746 12.02273 5.302321e-33
2 spanish - english b 0.2304084 0.01606823 7746 14.33938 4.808368e-46
3 spanish - english c 0.3432015 0.01693644 7746 20.26409 5.255320e-89
contrast LANGUAGE estimate SE df t.ratio p.value
1 a - b spanish -0.12208024 0.01719808 7746 -7.0984794 1.373008e-12
2 a - c spanish -0.22164497 0.01755089 7746 -12.6287005 3.330301e-36
3 b - c spanish -0.09956473 0.01648151 7746 -6.0409975 1.602418e-09
4 a - b english -0.10160168 0.01634940 7746 -6.2143982 5.418688e-10
5 a - c english -0.08837323 0.01684334 7746 -5.2467759 1.588849e-07
6 b - c english 0.01322845 0.01653453 7746 0.8000496 4.237066e-01
contrast estimate SE df t.ratio p.value
1 spanish a - english a 0.20992980 0.01746107 7746 12.0227322 5.302321e-33
2 spanish a - spanish b -0.12208024 0.01719808 7746 -7.0984794 1.373008e-12
3 spanish a - english b 0.10832812 0.01716338 7746 6.3115829 2.913980e-10
4 spanish a - spanish c -0.22164497 0.01755089 7746 -12.6287005 3.330301e-36
5 spanish a - english c 0.12155657 0.01763454 7746 6.8930955 5.887823e-12
6 english a - spanish b -0.33201004 0.01638582 7746 -20.2620299 5.468649e-89
7 english a - english b -0.10160168 0.01634940 7746 -6.2143982 5.418688e-10
8 english a - spanish c -0.43157477 0.01675574 7746 -25.7568246 1.939396e-140
9 english a - english c -0.08837323 0.01684334 7746 -5.2467759 1.588849e-07
10 spanish b - english b 0.23040836 0.01606823 7746 14.3393775 4.808368e-46
11 spanish b - spanish c -0.09956473 0.01648151 7746 -6.0409975 1.602418e-09
12 spanish b - english c 0.24363681 0.01657055 7746 14.7030002 2.748620e-48
13 english b - spanish c -0.32997309 0.01644529 7746 -20.0648958 2.410363e-87
14 english b - english c 0.01322845 0.01653453 7746 0.8000496 4.237066e-01
15 spanish c - english c 0.34320154 0.01693644 7746 20.2640927 5.255320e-89
1.5.4 Visual interpretation
Let’s visualize the nature of the interaction of the two predictors based on the predictions from effects:
plot(lasi, # plot the effect of LANGUAGE:SITE in m_02
xlab="Site", # w/ this x-axis label
ylab="RT in ms (logged)", # w/ this y-axis label
ylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limits
grid=TRUE) # & w/ a grid
plot(lasi, x.var="LANGUAGE", # plot the effect of LANGUAGE:SITE in m_02
xlab="Language", # w/ this x-axis label
ylab="RT in ms (logged)", # w/ this y-axis label
ylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limits
grid=TRUE) # & w/ a grid
Of the standard effects plots, I think this one is best:
plot(lasi, # plot the effect of LANGUAGE:SITE in m_02
xlab="Site", # w/ this x-axis label
ylab="RT in ms (logged)", # w/ this y-axis label
ylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limits
multiline=TRUE, # all data into one panel
confint=list(style="auto"), # but still confidence intervals
grid=TRUE) # & w/ a grid
But this is how I would plot this:
1.5.5 Write-up
To determine whether the reaction time to a word (in ms) varies as a function of the language in which the stimulus word was presented and the site at which the experiment was conducted, a linear model was fit with RT_log
as the response variable and LANGUAGE
as well as SITE
and their interaction as a predictor. The model was highly significant (F=192.1, df1=5, df2=7746, p<0.0001) but explains only little of the response variable (mult. R2=0.11, adj. R2=0.1098). The coefficients and predictions indicate that
- at each site, the (logged) RTs to Spanish words were slower than the RTs to English words:
- by a difference of 0.2 in site a;
- by a difference of 0.23 in site b;
- by a difference of 0.34 in site c;
- all combinations of
SITE
andLANGUAGE
but one are significantly different from each other, only RTs to English words do not differ significantly between sites b and c. [Show a plot]
Estimate | 95%-CI | se | t | p2-tailed | |
---|---|---|---|---|---|
Intercept (LANG=span, SITE=a) | 9.305 | [9.28, 9.331] | 0.013 | 722.06 | <0.001 |
LANGspan→eng | -0.21 | [-0.244, -0.176] | 0.017 | -12.02 | <0.001 |
SITEa→b | 0.122 | [0.088, 0.156] | 0.017 | 7.1 | <0.001 |
SITEa→c | 0.222 | [0.187, 0.256] | 0.018 | 12.63 | <0.001 |
LANGspan→eng, SITEa→b | -0.02 | [-0.067, 0.026] | 0.024 | -0.86 | 0.388 |
LANGspan→eng, SITEa→c | -0.133 | [-0.181, -0.086] | 0.024 | -5.48 | <0.001 |
Some housekeeping:
'data.frame': 7752 obs. of 10 variables:
$ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
$ RT_log : num 9.55 8.67 8.92 10.31 9.31 ...
$ GROUP : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
$ SITE : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...
$ LENGTH : int 5 5 6 4 6 6 8 4 5 6 ...
$ ZIPFFREQ: num 4.28 4.89 5.37 5.12 4.15 ...
$ LANGUAGE: Factor w/ 2 levels "spanish","english": 1 1 1 1 1 1 1 1 1 1 ...
$ RT : int 748 408 483 1266 636 395 765 1297 849 315 ...
$ FREQPMW : int 19 78 233 133 14 67 34 114 2 20 ...
$ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
1.6 A numeric and a categorical predictor
Now we turn to a scenario where we are asking, does the logged reaction time to a word (in ms) vary as a function of
- the language that the stimulus word was presented in:
LANGUAGE
: spanish vs. english; - the length of the word?
First, we explore a bit again:
spanish english
3775 3977
boxplot(main="RT in ms (logged) ~ LANGUAGE", varwidth=TRUE,
d$RT_log ~ d$LANGUAGE, notch=TRUE,
xlab="Language",
ylab="RT in ms (logged)", ylim=c(8, 12)); grid()
par(mfrow=c(1,2))
hist(d$LENGTH) # log2 or sqrt doesn't help much
plot(pch=16, col="#00000020",
x=jitter(d$LENGTH),
y=d$RT_log)
abline(v=quantiles(d$LENGTH), lty=3, col="#FF000040")
abline(h=quantiles(d$RT_log), lty=3, col="#FF000040")
par(mfrow=c(1,1))
1.6.1 Modeling & numerical interpretation
Let’s fit a model:
summary(m_01 <- lm( # make/summarize the linear model m_01:
RT_log ~ 1 + # RT_log ~ an overall intercept (1)
LANGUAGE + LENGTH, # & the predictors
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT_log ~ 1 + LANGUAGE + LENGTH, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.16687 -0.29052 -0.06813 0.21366 2.56698
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.235541 0.026947 342.736 < 2e-16 ***
LANGUAGEenglish -0.249161 0.010057 -24.774 < 2e-16 ***
LENGTH 0.034901 0.004771 7.315 2.84e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4303 on 7749 degrees of freedom
Multiple R-squared: 0.09315, Adjusted R-squared: 0.09291
F-statistic: 398 on 2 and 7749 DF, p-value: < 2.2e-16
Estimate 2.5 % 97.5 %
(Intercept) 9.23554072 9.18271822 9.28836322
LANGUAGEenglish -0.24916086 -0.26887582 -0.22944590
LENGTH 0.03490076 0.02554762 0.04425391
As before: are all 2 predictors significant?
Df Sum.of.Sq RSS AIC F.value Pr..F.
<none> NA NA 1434.600 -13072.13 NA NA
LANGUAGE 1 113.627901 1548.228 -12483.23 613.76174 1.828443e-130
LENGTH 1 9.905381 1444.505 -13020.79 53.50397 2.838880e-13
So, what do the coefficients mean, for which we return to the data and the model matrix:
RT_log LANGUAGE LENGTH
35 9.228819 spanish 3
4 10.306062 spanish 4
178 9.233620 english 3
96 8.882643 english 4
(Intercept) LANGUAGEenglish LENGTH
35 1 0 3
4 1 0 4
178 1 1 3
96 1 1 4
The model matrix shows that we basically now have a combination of what we saw for each categorical predictor above and we can see that
- the intercept is the prediction for when
-
LANGUAGE
is spanish and -
LENGTH
is 0 (because then theLENGTH
values are multiplied by 0 and don’t change the result);
-
- coefficient 2 is how we get from the intercept to the prediction for when
-
LANGUAGE
changes from spanish to english, but -
LENGTH
isn’t mentioned, it remains the intercept value 0;
-
- coefficient 3 is how we get from the intercept to the prediction for when
-
LANGUAGE
isn’t mentioned, it remains the intercept value spanish, but -
LENGTH
changes from 0 to 1.
-
The coefficients are therefore used like above:
"*"( # prediction for case 178 (LANGUAGE: english, LENGTH: 3):
coef(m_01),
model.matrix(m_01)[178,]) %>% sum
[1] 9.091082
"*"( # prediction for case 96 (LANGUAGE: english, LENGTH: 4):
coef(m_01),
model.matrix(m_01)[96,]) %>% sum
[1] 9.125983
Thus, this is what this model predicts and we again add it to the data frame d
:
CASE RT_log GROUP SITE LENGTH ZIPFFREQ LANGUAGE RT FREQPMW CORRECT
35 36 9.228819 heritage b 3 4.000000 spanish 600 10 correct
4 4 10.306062 heritage b 4 5.123852 spanish 1266 133 correct
178 192 9.233620 heritage b 3 4.653213 english 602 45 correct
96 105 8.882643 heritage a 4 5.252853 english 472 179 correct
PRED
35 9.340243
4 9.375144
178 9.091082
96 9.125983
How good is this model? Same as always:
[1] 1434.6
[1] 0.09314538
Let’s compute emmeans
and effects
predictions again; we begin with the emmeans
results for LANGUAGE
:
LANGUAGE emmean SE df lower.CL upper.CL
1 spanish 9.417027 0.007106631 7749 9.403096 9.430958
2 english 9.167867 0.006918707 7749 9.154304 9.181429
What is that? From what we said above, we can infer that emmeans
gives us predictions for each level of LANGUAGE
while LENGTH
is held constant at the mean (and we will confirm that in a second using effects
).
What about the emmeans
results for LENGTH
, what is that? Here, we get the prediction when LENGTH
is held at its mean:
LENGTH emmean SE df lower.CL upper.CL
1 5.200077 9.292447 0.004888682 7749 9.282864 9.30203
However, since LENGTH
is a numeric predictor, we do kinda want to see its slope. Here’s a way we can get this: the following code returns the slope of LENGTH
for each level of LANGUAGE
in the column called LENGTH.trend
and they are the same because right now we are not letting the slope of LENGTH
vary by level of LANGUAGE
(because we have no interaction LANGUAGE:LENGTH
in the model):
LANGUAGE LENGTH LENGTH.trend SE df
1 spanish 5.200077 0.03490076 0.004771357 7749
2 english 5.200077 0.03490076 0.004771357 7749
What does effects
return for LANGUAGE
? The same as emmeans
: the predictions for the two levels of LANGUAGE
while LENGTH
is its mean:
(lan_d <- data.frame( # make lan_d a data frame of
lan <- predictorEffect( # the effect
"LANGUAGE", # of LANGUAGE
m_01))) # in m_01
LANGUAGE fit se lower upper
1 spanish 9.417027 0.007106631 9.403096 9.430958
2 english 9.167867 0.006918707 9.154304 9.181429
What does effects
return for LENGTH
(when we again ask for predictions for relevant values)?
(relevants_l <- c(
0:1, # these help us understand the slope
c(3,5,7,9), # these cover the range of the actual values
mean(d$LENGTH)) %>% sort) # the central tendency
[1] 0.000000 1.000000 3.000000 5.000000 5.200077 7.000000 9.000000
(len_d <- data.frame( # make len_d a data frame of
len <- effect( # the effect
"LENGTH", # of LENGTH
m_01, # in m_01 for the following values
xlevels=list(LENGTH=relevants_l))))
LENGTH fit se lower upper
1 0.000000 9.107714 0.025288119 9.058142 9.157286
2 1.000000 9.142615 0.020627323 9.102180 9.183050
3 3.000000 9.212416 0.011579141 9.189718 9.235115
4 5.000000 9.282218 0.004979295 9.272457 9.291979
5 5.200077 9.289201 0.004886925 9.279621 9.298780
6 7.000000 9.352019 0.009881146 9.332650 9.371389
7 9.000000 9.421821 0.018777847 9.385011 9.458631
Now it’s not the same as emmeans
– 9.292447 (from emmeans
) is not the same as 9.289201 (from effects
) – because, remember, effects
gives you predictions that involve weighted averages, with the weights being based on the frequencies of the levels of the so-called non-focal predictors:
# emmeans:
c(
coef(m_01)[1] + mean(d$LENGTH) * coef(m_01)[3],
coef(m_01)[1] + mean(d$LENGTH) * coef(m_01)[3] + coef(m_01)[2]) %>% mean
[1] 9.292447
# effects:
c(
coef(m_01)[1] + mean(d$LENGTH) * coef(m_01)[3],
coef(m_01)[1] + mean(d$LENGTH) * coef(m_01)[3] + coef(m_01)[2]) %>% weighted.mean(table(d$LANGUAGE))
[1] 9.289201
1.6.2 Visual interpretation
Let’s visualize each predictor’s effect:
1.6.3 Modeling with interactions
As before, the critical thing to realize here is that the effect of LENGTH
is the same everywhere, i.e. regardless of whether LANGUAGE
is spanish or english. For the same obvious reasons, this is again additive behavior. But it might again not be appropriate or accurate: Maybe the effect of LENGTH
is not the same regardless of the value of LANGUAGE
. If that’s the case, this is again an interaction: something doesn’t do the same everywhere/always.
How do we check for this in R? Like this:
summary(m_02 <- lm( # make/summarize the linear model m_01:
RT_log ~ 1 + # RT_log ~ an overall intercept (1)
LANGUAGE + LENGTH + # & the predictors
LANGUAGE:LENGTH, # & their interaction <----------------------------------------------
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT_log ~ 1 + LANGUAGE + LENGTH + LANGUAGE:LENGTH,
data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.16406 -0.29017 -0.06899 0.21528 2.56360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.201792 0.032005 287.515 < 2e-16 ***
LANGUAGEenglish -0.145808 0.053852 -2.708 0.00679 **
LENGTH 0.041089 0.005726 7.175 7.88e-13 ***
LANGUAGEenglish:LENGTH -0.020223 0.010352 -1.954 0.05079 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4302 on 7748 degrees of freedom
Multiple R-squared: 0.09359, Adjusted R-squared: 0.09324
F-statistic: 266.7 on 3 and 7748 DF, p-value: < 2.2e-16
Is the interaction significant? It is an interaction of two predictors with 1 df so we already see that it’s just about not (unless, that is, you had good reason to expect an interaction term of LANGUAGEenglish:LENGTH
with a negative sign, in which case you could divide the p-value of 0.05079 by 2 and score a significant result).
Df Sum.of.Sq RSS AIC F.value Pr..F.
<none> NA NA 1433.894 -13073.94 NA NA
LANGUAGE:LENGTH 1 0.7062722 1434.600 -13072.13 3.81632 0.05079154
Res.Df RSS Df Sum.of.Sq F Pr..F.
1 7749 1434.600 NA NA NA NA
2 7748 1433.894 1 0.7062722 3.81632 0.05079154
Just about not so m_01
is actually the model to go with. For didactic reasons, we will still look at m_02
tho.
First, the coefficients and the model matrix:
RT_log LANGUAGE LENGTH
35 9.228819 spanish 3
4 10.306062 spanish 4
178 9.233620 english 3
96 8.882643 english 4
(Intercept) LANGUAGEenglish LENGTH LANGUAGEenglish:LENGTH
35 1 0 3 0
4 1 0 4 0
178 1 1 3 3
96 1 1 4 4
The model matrix shows that we basically now have a combination of what we saw for each predictor above and we can see that
- the intercept is the prediction for when
-
LANGUAGE
is spanish and -
LENGTH
is 0 (because then theLENGTH
values are multiplied by 0 and don’t change the result);
-
- coefficient 2 is how we get from the intercept to the prediction for when
-
LANGUAGE
changes from spanish to english, but -
LENGTH
isn’t mentioned, it remains the intercept value 0;
-
- coefficient 3 is how we get from the intercept to the prediction for when
-
LANGUAGE
isn’t mentioned, it remains the intercept value spanish, but -
LENGTH
changes from 0 to 1.
-
But now we do not just have additive effects, which means the predictions for LANGUAGE
: english and LENGTH
: 1 are different in how they require an additional addition, the interaction term: coefficient 4 is what you add to the intercept, the main effect of LANGUAGE
: spanish→english, and the main effect of LENGTH
: 0→1 (see cases 178 and 96). The coefficients are therefore used like above:
"*"( # prediction for case 178 (LANGUAGE: english, LENGTH: 3):
coef(m_01),
model.matrix(m_01)[178,]) %>% sum
[1] 9.091082
"*"( # prediction for case 96 (LANGUAGE: english, LENGTH: 4):
coef(m_01),
model.matrix(m_01)[96,]) %>% sum
[1] 9.125983
Thus, this is what this model predicts and we again add all predictions to the data frame d
:
CASE RT_log GROUP SITE LENGTH ZIPFFREQ LANGUAGE RT FREQPMW CORRECT
35 36 9.228819 heritage b 3 4.000000 spanish 600 10 correct
4 4 10.306062 heritage b 4 5.123852 spanish 1266 133 correct
178 192 9.233620 heritage b 3 4.653213 english 602 45 correct
96 105 8.882643 heritage a 4 5.252853 english 472 179 correct
PRED
35 9.340243
4 9.375144
178 9.091082
96 9.125983
Let’s compute emmeans
and effects
predictions again. What does emmeans
return here for LANGUAGE
, when we consider that LENGTH
is also in the model?
LANGUAGE LENGTH emmean SE df lower.CL upper.CL
1 spanish 5.200077 9.415459 0.007150550 7748 9.401442 9.429476
2 english 5.200077 9.164490 0.007130065 7748 9.150514 9.178467
These are predictions for each level of LANGUAGE
while LENGTH
is held constant at the mean:
For LENGTH
, we would want to know the slopes of LENGTH
for each level of LANGUAGE
so we use emtrends
again:
LANGUAGE LENGTH.trend SE df lower.CL upper.CL
1 spanish 0.04108912 0.005726456 7748 0.029863715 0.05231452
2 english 0.02086630 0.008623755 7748 0.003961409 0.03777119
Now that we have the interaction in the model, we do get a different slope of LENGTH
for each level of LANGUAGE
.
What does effects
return? Predictions that are perfectly compatible (check rows 9-10 of the output of the first bit of code and rows 5 and 12 of the output of the second bit of code):
(lale_d <- data.frame( # make lale_d a data frame of
lale <- effect( # the effect (or predictorEffect)
"LANGUAGE:LENGTH", # of LANGUAGE:LENGTH
m_02, # in m_02 for the following values
xlevels=list(LENGTH=relevants_l))))
LANGUAGE LENGTH fit se lower upper
1 spanish 0.000000 9.201792 0.032004567 9.139055 9.264530
2 english 0.000000 9.055984 0.043310253 8.971084 9.140884
3 spanish 1.000000 9.242882 0.026446520 9.191039 9.294724
4 english 1.000000 9.076850 0.034820641 9.008592 9.145108
5 spanish 3.000000 9.325060 0.015697909 9.294288 9.355832
6 english 3.000000 9.118583 0.018223335 9.082860 9.154306
7 spanish 5.000000 9.407238 0.007467839 9.392599 9.421877
8 english 5.000000 9.160316 0.006830522 9.146926 9.173705
9 spanish 5.200077 9.415459 0.007150550 9.401442 9.429476
10 english 5.200077 9.164490 0.007130065 9.150514 9.178467
11 spanish 7.000000 9.489416 0.011289432 9.467286 9.511547
12 english 7.000000 9.202048 0.018872607 9.165053 9.239044
13 spanish 9.000000 9.571595 0.021481906 9.529484 9.613705
14 english 9.000000 9.243781 0.035505602 9.174180 9.313381
1.6.4 Visual interpretation
Here’s an effects plot of the interaction:
plot(lale, # plot the effect of LANGUAGE:LENGTH in m_02
xlab="Length", # w/ this x-axis label
ylab="RT in ms (logged)", # w/ this y-axis label
ylim=range(d$RT_log, na.rm=TRUE), # w/ these y-axis limits
multiline=TRUE, # all data into one panel
confint=list(style="auto"), # but still confidence intervals
grid=TRUE) # & w/ a grid
This is how I would plot this:
1.6.5 Write-up
[I will now discuss m_02
because it has the interaction – in a real study you would discuss m_01
because the interaction was ns!]
To determine whether the reaction time to a word (in ms) varies as a function of the language in which the stimulus word was presented and the length of the word, a linear model was fit with RT_log
as the response variable and LANGUAGE
as well as LENGTH
and their interaction as predictors. The model was highly significant (F=266.7, df1=3, df2=7748, p<0.0001) but explains only little of the response variable (mult. R2=0.094, adj. R2=0.0.093). The coefficients and predictions indicate that the (positive) slope of LENGTH
for Spanish is significantly higher than the (still positive) one for English words. [Show a plot]
Estimate | 95%-CI | se | t | p2-tailed | |
---|---|---|---|---|---|
Intercept (LANG=span, LENGTH=0) | 9.202 | [9.139, 9.265] | 0.032 | 287.52 | <0.001 |
LANGspan→eng | -0.146 | [-0.251, -0.04] | 0.054 | -2.71 | <0.007 |
LENGTH0→1 | 0.041 | [-0.03, 0.052] | 0.006 | 7.18 | <0.001 |
LANGspan→eng, LENGTH0→1 | -0.02 | [-0.041, 0] | 0.011 | -1.95 | 0.051 |
Quiz question: We ‘know’ that the slope of LENGTH
for LANGUAGE
: english is significantly less than the slope of LENGTH
for LANGUAGE
: spanish (from the summary output). But is the slope of LENGTH
for LANGUAGE
: english significantly different from 0? How do we find that out? # replace this with your answers/code
1.7 Two numeric predictors
Last one: we are asking, does the logged reaction time to a word (in ms) vary as a function of
- the (Zipf) frequency of the word;
- the length of the word?
We’ve looked at the predictors before so we’ll keep this short here:
par(mfrow=c(1, 2)) # make the plotting window have 1 row & 2 columns
plot(pch=16, col="#00000020",
x=d$ZIPFFREQ,
y=d$RT_log)
abline(v=quantiles(d$ZIPFFREQ), lty=2, col="#0000FF40")
abline(h=quantiles(d$RT_log) , lty=2, col="#0000FF40")
plot(pch=16, col="#00000020",
x=jitter(d$LENGTH),
y=d$RT_log)
abline(v=quantiles(d$LENGTH), lty=2, col="#0000FF40")
abline(h=quantiles(d$RT_log), lty=2, col="#0000FF40")
par(mfrow=c(1, 1)) # reset to default
1.7.1 Modeling & numerical interpretation
Let’s fit a model:
summary(m_01 <- lm( # make/summarize the linear model m_01:
RT_log ~ 1 + # RT_log ~ an overall intercept (1)
ZIPFFREQ + LENGTH, # & the predictors
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT_log ~ 1 + ZIPFFREQ + LENGTH, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.24899 -0.30279 -0.07074 0.22393 2.70915
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.496032 0.051630 183.93 <2e-16 ***
ZIPFFREQ -0.109251 0.009231 -11.84 <2e-16 ***
LENGTH 0.057068 0.004799 11.89 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.443 on 7749 degrees of freedom
Multiple R-squared: 0.03869, Adjusted R-squared: 0.03845
F-statistic: 156 on 2 and 7749 DF, p-value: < 2.2e-16
Estimate 2.5 % 97.5 %
(Intercept) 9.49603232 9.39482453 9.59724010
ZIPFFREQ -0.10925086 -0.12734680 -0.09115492
LENGTH 0.05706792 0.04766146 0.06647439
As before: are all 2 predictors significant?
Df Sum.of.Sq RSS AIC F.value Pr..F.
<none> NA NA 1520.741 -12620.10 NA NA
ZIPFFREQ 1 27.48711 1548.228 -12483.23 140.0617 4.866190e-32
LENGTH 1 27.75700 1548.498 -12481.88 141.4370 2.465044e-32
So, what do the coefficients mean, for which we return to the data and the model matrix:
(Intercept) ZIPFFREQ LENGTH
1 1 4.278754 5
2 1 4.892095 5
3 1 5.367356 6
4 1 5.123852 4
5 1 4.146128 6
6 1 4.826075 6
The model matrix shows that we basically now have a combination of what we saw for each categorical predictor above and we can see that
- the intercept is the prediction for when both
ZIPFFREQ
is 0 andLENGTH
is 0 (because then the multiplications would result in zeros to be added to the intercept); - coefficient 2 is how we get from the intercept to the prediction for when
-
ZIPFFREQ
changes from 0 to 1, but -
LENGTH
isn’t mentioned, it remains the intercept value 0;
-
- coefficient 3 is how we get from the intercept to the prediction for when
-
ZIPFFREQ
isn’t mentioned, it remains the intercept value 0, but -
LENGTH
changes from 0 to 1.
-
The coefficients are therefore used like above: For the predictions, we multiply each coefficient with each row in the model matrix:
apply( # apply to the head of the model matrix:
head(model.matrix(m_01)),
1, # row by row the multiplication & summation:
\(af) sum(af*coef(m_01)))
1 2 3 4 5 6
9.313914 9.246906 9.252052 9.164519 9.385472 9.311187
CASE RT_log GROUP SITE LENGTH ZIPFFREQ LANGUAGE RT FREQPMW CORRECT
1 1 9.546894 heritage b 5 4.278754 spanish 748 19 correct
2 2 8.672425 heritage b 5 4.892095 spanish 408 78 correct
3 3 8.915879 heritage b 6 5.367356 spanish 483 233 correct
4 4 10.306062 heritage b 4 5.123852 spanish 1266 133 correct
5 5 9.312883 heritage b 6 4.146128 spanish 636 14 correct
6 6 8.625709 heritage b 6 4.826075 spanish 395 67 correct
PRED
1 9.313914
2 9.246906
3 9.252052
4 9.164519
5 9.385472
6 9.311187
We see the model is not great from the R-squared, which I am not computing again from the deviance.
Let’s compute emmeans
and effects
predictions again. What does emmeans
return? The slope of each predictor while the other is at its mean:
LENGTH ZIPFFREQ.trend SE df lower.CL upper.CL
1 5.200077 -0.1092509 0.009231348 7749 -0.1273468 -0.09115492
ZIPFFREQ LENGTH.trend SE df lower.CL upper.CL
1 4.609476 0.05706792 0.004798556 7749 0.04766146 0.06647439
What does effects
return? Predictions that are perfectly compatible (check rows 1-2 of both outputs):
(zip_d <- data.frame( # make zip_d a data frame of
zip <- effect( # the effect
"ZIPFFREQ", # of ZIPFFREQ
m_01, # in m_01 for the following values
xlevels=list(ZIPFFREQ=relevants_z))))
ZIPFFREQ fit se lower upper
1 0.000000 9.792790 0.042848122 9.708796 9.876784
2 1.000000 9.683539 0.033698080 9.617482 9.749596
3 3.000000 9.465037 0.015686472 9.434288 9.495787
4 4.000000 9.355786 0.007547924 9.340991 9.370582
5 4.609476 9.289201 0.005031505 9.279338 9.299064
6 5.000000 9.246536 0.006189709 9.234402 9.258669
7 6.000000 9.137285 0.013787292 9.110258 9.164312
(len_d <- data.frame( # make len_d a data frame of
len <- effect( # the effect
"LENGTH", # of LENGTH
m_01, # in m_01 for the following values
xlevels=list(LENGTH=relevants_l))))
LENGTH fit se lower upper
1 0.000000 8.992443 0.025455085 8.942544 9.042342
2 1.000000 9.049511 0.020772869 9.008791 9.090231
3 3.000000 9.163647 0.011694887 9.140722 9.186572
4 5.000000 9.277783 0.005122285 9.267742 9.287824
5 5.200077 9.289201 0.005031505 9.279338 9.299064
6 7.000000 9.391919 0.009995715 9.372324 9.411513
7 9.000000 9.506054 0.018915600 9.468975 9.543134
1.7.2 Visual interpretation
Let’s visualize each predictor’s effect:
1.7.3 Modeling with interactions
As before, the critical thing to realize here is that the effect of LENGTH
is the same everywhere, i.e. regardless of what ZIPFFREQ
is, and of course the other way round. But it might again not be appropriate or accurate: Maybe the effect of LENGTH
is not the same regardless of the value of ZIPFFREQ
(and vice versa). If that’s the case, this is again an interaction: something doesn’t do the same everywhere/always.
How do we check for this in R? By now we know, same way as always:
summary(m_02 <- lm( # make/summarize the linear model m_01:
RT_log ~ 1 + # RT_log ~ an overall intercept (1)
ZIPFFREQ + LENGTH + # & the predictors
ZIPFFREQ:LENGTH, # & their interaction <----------------------------------------------
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
lm(formula = RT_log ~ 1 + ZIPFFREQ + LENGTH + ZIPFFREQ:LENGTH,
data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.24632 -0.30144 -0.07187 0.22311 2.71514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.120698 0.199441 45.731 < 2e-16 ***
ZIPFFREQ -0.026128 0.043651 -0.599 0.549484
LENGTH 0.129044 0.037253 3.464 0.000535 ***
ZIPFFREQ:LENGTH -0.015979 0.008201 -1.948 0.051413 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4429 on 7748 degrees of freedom
Multiple R-squared: 0.03916, Adjusted R-squared: 0.03879
F-statistic: 105.3 on 3 and 7748 DF, p-value: < 2.2e-16
Is the interaction significant? It is an interaction of two predictors with 1 df so we already see that it’s not.
Df Sum.of.Sq RSS AIC F.value Pr..F.
<none> NA NA 1519.996 -12621.89 NA NA
ZIPFFREQ:LENGTH 1 0.744684 1520.741 -12620.10 3.795938 0.05141312
Res.Df RSS Df Sum.of.Sq F Pr..F.
1 7749 1520.741 NA NA NA NA
2 7748 1519.996 1 0.744684 3.795938 0.05141312
Just about not so m_01
is actually the model to go with. For didactic reasons, we will again look at m_02
though. So, what do the coefficients mean, for which we return to the data and the model matrix:
(Intercept) ZIPFFREQ LENGTH ZIPFFREQ:LENGTH
1 1 4.278754 5 21.39377
2 1 4.892095 5 24.46047
3 1 5.367356 6 32.20414
4 1 5.123852 4 20.49541
5 1 4.146128 6 24.87677
6 1 4.826075 6 28.95645
The model matrix shows that we basically now have a combination of what we saw for each categorical predictor above and we can see that
- the intercept is the prediction for when both
ZIPFFREQ
is 0 andLENGTH
is 0 (because then the multiplications would result in zeros to be added to the intercept); - coefficient 2 is how we get from the intercept to the prediction for when
-
ZIPFFREQ
changes from 0 to 1, but -
LENGTH
isn’t mentioned, it remains the intercept value 0;
-
- coefficient 3 is how we get from the intercept to the prediction for when
-
ZIPFFREQ
isn’t mentioned, it remains the intercept value 0, but -
LENGTH
changes from 0 to 1.
-
But now we do not just have additive effects, which means the prediction for ZIPFFREQ
: 1 and LENGTH
: 1 is different in how it requires an additional addition, the interaction term: coefficient 4 is what you add to the intercept, the main effect of ZIPFFREQ
: 0→1, and the main effect of LENGTH
: 0→1.
apply( # apply to the head of the model matrix:
head(model.matrix(m_02)),
1, # row by row the multiplication & summation:
\(af) sum(af*coef(m_02)))
1 2 3 4 5 6
9.312283 9.247256 9.240150 9.175513 9.389139 9.306186
CASE RT_log GROUP SITE LENGTH ZIPFFREQ LANGUAGE RT FREQPMW CORRECT
1 1 9.546894 heritage b 5 4.278754 spanish 748 19 correct
2 2 8.672425 heritage b 5 4.892095 spanish 408 78 correct
3 3 8.915879 heritage b 6 5.367356 spanish 483 233 correct
4 4 10.306062 heritage b 4 5.123852 spanish 1266 133 correct
5 5 9.312883 heritage b 6 4.146128 spanish 636 14 correct
6 6 8.625709 heritage b 6 4.826075 spanish 395 67 correct
PRED
1 9.312283
2 9.247256
3 9.240150
4 9.175513
5 9.389139
6 9.306186
We see the model is not great from the R-squared, which I am not computing again from the deviance.
Let’s compute emmeans
and effects
predictions again. What does emmeans
return here for each predictor when we consider that the corresponding other one is also in the model? The slope of each predictor while the other is at its mean:
LENGTH ZIPFFREQ.trend SE df lower.CL upper.CL
1 5.200077 -0.1092176 0.009229699 7748 -0.1273103 -0.09112488
ZIPFFREQ LENGTH.trend SE df lower.CL upper.CL
1 4.609476 0.05539128 0.004874258 7748 0.04583642 0.06494614
What does effects
return? Predictions that are perfectly compatible (check rows 29 and 30 for ZIPFFREQ
as well as 5 and 12 for LENGTH
of this output):
(zile_d <- data.frame( # make zile_d a data frame of
zile <- effect( # the effect
"ZIPFFREQ:LENGTH", # of ZIPFFREQ:LENGTH
m_02, # in m_02
xlevels=list( # for these values of the predictors
ZIPFFREQ=relevants_z,
LENGTH=relevants_l))))
ZIPFFREQ LENGTH fit se lower upper
1 0.000000 0.000000 9.120698 0.199441378 8.729739 9.511657
2 1.000000 0.000000 9.094570 0.156249733 8.788279 9.400862
3 3.000000 0.000000 9.042315 0.071558871 8.902040 9.182590
4 4.000000 0.000000 9.016187 0.034499982 8.948558 9.083817
5 4.609476 0.000000 9.000263 0.025765045 8.949757 9.050770
6 5.000000 0.000000 8.990060 0.032720725 8.925918 9.054201
7 6.000000 0.000000 8.963932 0.069006254 8.828661 9.099203
8 0.000000 1.000000 9.249742 0.163253667 8.929721 9.569764
9 1.000000 1.000000 9.207636 0.127949737 8.956820 9.458452
10 3.000000 1.000000 9.123423 0.058712317 9.008331 9.238515
11 4.000000 1.000000 9.081317 0.028350693 9.025742 9.136892
12 4.609476 1.000000 9.055654 0.021007114 9.014475 9.096834
13 5.000000 1.000000 9.039211 0.026573248 8.987120 9.091302
14 6.000000 1.000000 8.997105 0.056162828 8.887010 9.107199
15 0.000000 3.000000 9.507831 0.093425243 9.324692 9.690969
16 1.000000 3.000000 9.433767 0.073348608 9.289984 9.577550
17 3.000000 3.000000 9.285640 0.033936620 9.219115 9.352165
18 4.000000 3.000000 9.211577 0.016481134 9.179269 9.243884
19 4.609476 3.000000 9.166437 0.011780144 9.143345 9.189529
20 5.000000 3.000000 9.137513 0.014638709 9.108818 9.166209
21 6.000000 3.000000 9.063450 0.031299940 9.002094 9.124806
22 0.000000 5.000000 9.765919 0.043670235 9.680314 9.851525
23 1.000000 5.000000 9.659898 0.034375324 9.592514 9.727283
24 3.000000 5.000000 9.447857 0.016075601 9.416345 9.479370
25 4.000000 5.000000 9.341837 0.007785475 9.326575 9.357098
26 4.609476 5.000000 9.277219 0.005129513 9.267164 9.287275
27 5.000000 5.000000 9.235816 0.006218844 9.223625 9.248007
28 6.000000 5.000000 9.129795 0.013878263 9.102590 9.157000
29 0.000000 5.200077 9.791738 0.042843797 9.707752 9.875723
30 1.000000 5.200077 9.682520 0.033696060 9.616467 9.748574
31 3.000000 5.200077 9.464085 0.015691256 9.433326 9.494844
32 4.000000 5.200077 9.354868 0.007561289 9.340045 9.369690
33 4.609476 5.200077 9.288302 0.005051700 9.278399 9.298205
34 5.000000 5.200077 9.245650 0.006205266 9.233486 9.257814
35 6.000000 5.200077 9.136432 0.013791746 9.109397 9.163468
36 0.000000 7.000000 10.024007 0.078665115 9.869803 10.178212
37 1.000000 7.000000 9.886030 0.061449727 9.765572 10.006488
38 3.000000 7.000000 9.610074 0.027700695 9.555773 9.664375
39 4.000000 7.000000 9.472096 0.013065961 9.446483 9.497709
40 4.609476 7.000000 9.388002 0.010194076 9.368019 9.407985
41 5.000000 7.000000 9.334118 0.013389705 9.307871 9.360366
42 6.000000 7.000000 9.196141 0.028160666 9.140938 9.251343
43 0.000000 9.000000 10.282096 0.146872763 9.994185 10.570006
44 1.000000 9.000000 10.112161 0.114731149 9.887257 10.337065
45 3.000000 9.000000 9.772291 0.051753307 9.670840 9.873741
46 4.000000 9.000000 9.602356 0.024534566 9.554262 9.650450
47 4.609476 9.000000 9.498785 0.019276764 9.460997 9.536572
48 5.000000 9.000000 9.432421 0.025221465 9.382980 9.481862
49 6.000000 9.000000 9.262486 0.052734590 9.159112 9.365860
1.7.4 Visual interpretation
Let’s visualize each predictor’s effect:
plot(zile, # plot the effect of ZIPFFREQ
xlab="Zipf frequency", # w/ this x-axis label
ylab="RT in ms (logged)", # w/ this y-axis label
ylim=c(8, 12), # w/ these y-axis limits
grid=TRUE) # & w/ a grid
Not great. Two options, but for both we first increase resolution in zile_d
:
zile_d <- data.frame( # make zile_d a data frame of
zile <- effect( # the effect
"ZIPFFREQ:LENGTH", # of ZIPFFREQ:LENGTH
m_02, # in m_02
xlevels=list( # for these values of the predictors
ZIPFFREQ=seq(min(d$ZIPFFREQ), max(d$ZIPFFREQ), length.out=15), # <------------------
LENGTH =seq(min(d$LENGTH) , max(d$LENGTH) , length.out=15)))) # <------------------
Option 1 is a 3-D plot for websites and your own reports:
library(plotly)
fig <- plot_ly(zile_d,
x=~ZIPFFREQ, y=~LENGTH, z=~fit, color=~fit, colors=c("#FFFFFF20", "#00000020")) %>%
add_markers %>%
layout(scene=list(
xaxis=list(title = "Zipf frequency"),
yaxis=list(title = "Length"),
zaxis=list(title = "Predicted RT")))
fig
Option 2 is a numeric heatmap for publications:
zile_d$fitcat <- as.numeric( # create a numeric column fitcat in zile_d
cut(zile_d$fit, # by cutting the predicted values up
breaks=seq( # at the values ranging from
min(zile_d$fit), # the smallest pred RT to <-------------------------------------
max(zile_d$fit), # the largest pred RT <-------------------------------------
length.out=10), # into 9 groups
include.lowest=TRUE, # include the lowest, and
labels=1:9)) # use the numbers 1:9 as labels
plot(type="n", # plot nothing, just
xlab="Zipf frequency", x=zile_d$ZIPFFREQ, # get the coordinate
ylab="LENGTH" , y=zile_d$LENGTH, ); grid() # system set up
# add the predictions
text(x=zile_d$ZIPFFREQ, y=zile_d$LENGTH, # plot text at these coordinates, namely
labels=zile_d$fitcat, # the 'categorical/ordinal' predictions
# make the physical size of the numbers dependent on the values
cex=0.5+zile_d$fitcat/7,
# make the darkness of the grey reflect the predicted RT
col=grey(level=zile_d$fitcat/10)) # see ?grey
But why is this specific one potentially misleading? This is because this procedure can exaggerate the size of the effects that the two predictors have, namely when the predicted RTs in zile_d$fit
cover only a small range of the range of the actually observed RTs. Then our binning of the predictions into 9 bins make the lowest and the highest predicted bin – 1 and 9 – seem widely apart when in fact they are fairly close to each other when considered against the background of the range of the actually observed RTs. To avoid something like that, we above always defined y-axis limits such that they cover the whole range of reaction times. The equivalent to that here would be to take care that we anchor the range of bin values by (i) the smallest RT
value from both the observed and the predicted data and (ii) the largest RT
value from both the observed and the predicted data. That will make maximally sure the plot is not misleading. Here’s how can do this:
And then we use cut
to bin like before, just with the new breaks
setting:
zile_d$fitcat <- as.numeric( # create a numeric column fitcat in zile_d
cut(zile_d$fit, # by cutting the predicted values up
breaks=seq( # at the values ranging from
min_of_obs_n_pred, # the smallest obs/pred RT to <---------------------------------
max_of_obs_n_pred, # the largest obs/pred RT <---------------------------------
length.out=10), # into 9 groups
include.lowest=TRUE, # include the lowest, and
labels=1:9)) # use the numbers 1:9 as labels
plot(type="n", # plot nothing, just
xlab="Zipf frequency", x=zile_d$ZIPFFREQ, # get the coordinate
ylab="Length" , y=zile_d$LENGTH, ); grid() # system set up
# add the predictions
text(x=zile_d$ZIPFFREQ, y=zile_d$LENGTH, # plot text at these coordinates, namely
labels=zile_d$fitcat, # the 'categorical/ordinal' predictions
# make the physical size of the numbers dependent on the values
cex=0.5+zile_d$fitcat/7,
# make the darkness of the grey reflect the predicted RT
col=grey(level=zile_d$fitcat/10)) # see ?grey
Much more realistic and much more sobering…
1.7.5 Write-up
[I will now discuss m_02
because it has the interaction – in a real study you would discuss m_01
because the interaction was ns!] To determine whether the reaction time to a word (in ms) varies as a function of the language in which the stimulus word was presented and the length of the word, a linear model was fit with RT_log
as the response variable and LANGUAGE
as well as LENGTH
and their interaction as predictors. The model was highly significant (F=266.7, df1=3, df2=7748, p<0.0001) but explains only little of the response variable (mult. R2=0.094, adj. R2=0.0.093). The coefficients and predictions indicate that LENGTH
is positively correlated with (logged) RT whereas ZIPFFREQ
is negatively correlated with (logged) RT. However, the two variables interact weakly such that
- when
ZIPFFREQ
is low,LENGTH
has a stronger effect than whenZIPFFREQ
is high; - when
LENGTH
is small,ZIPFFREQ
has a weaker effect than whenLENGTH
is high. [Show a plot]
Estimate | 95%-CI | se | t | p2-tailed | |
---|---|---|---|---|---|
Intercept (ZIPF=0, LENGTH=0) | 9.121 | [8.73, 9.517] | 0.199 | 47.73 | <0.001 |
ZIPF0→1 | -0.026 | [-0.112, 0.059] | 0.044 | -0.599 | 0.55 |
LENGTH0→1 | 0.129 | [0.056, 0.202] | 0.037 | 3.464 | <0.001 |
ZIPF0→1, LENGTH0→1 | -0.016 | [-0.032, 0] | 0.008 | -1.95 | 0.051 |
2 Binary logistic regression modeling
2.1 Introduction
rm(list=ls(all=TRUE)); invisible(gc())
set.seed(sum(utf8ToInt("Gummibärchen")))
invisible(sapply(c("car", "effects", "emmeans", "magrittr", "multcomp"),
library, character.only=TRUE, logical.return=TRUE,
quietly=TRUE, verbose=FALSE, warn.conflicts=TRUE))
invisible(sapply(dir("_helpers", full.names=TRUE), source))
Imagine you have a binary variable such as the choice of a genitive construction; the data are in _input/genitives.csv and you can find information about the variables/columns in _input/genitives.r, and our response variable is called GENITIVE
:
summary(d <- read.delim( # summarize d, the result of loading
file="_input/genitives.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
CASE GENITIVE SPEAKER MODALITY POR_LENGTH
Min. : 2 of:2720 nns:2666 spoken :1685 Min. : 1.00
1st Qu.:1006 s : 880 ns : 934 written:1915 1st Qu.: 8.00
Median :2018 Median : 11.00
Mean :2012 Mean : 14.58
3rd Qu.:3017 3rd Qu.: 17.00
Max. :4040 Max. :204.00
PUM_LENGTH POR_ANIMACY POR_FINAL_SIB POR_DEF
Min. : 2.00 animate : 920 absent :2721 definite :2349
1st Qu.: 6.00 collective: 607 present: 879 indefinite:1251
Median : 9.00 inanimate :1671
Mean : 10.35 locative : 243
3rd Qu.: 13.00 temporal : 159
Max. :109.00
There are 3600 genitives in the data. Your job is to guess each one of them such that you guess correctly as many as possible. What would you do?
Let’s see how good each of these is, where we determine goodness on the basis of the number/proportion of correct guesses:
[1] 0.5005556 0.5166667 0.6366667 0.6388889
Here are the corresponding results for my guesses …
stgs_picks <- rep(d$GENITIVE %>% table %>% sort %>% tail(1) %>% names, nrow(d))
(stgs_accuracy <- mean(stgs_picks==d$GENITIVE))
[1] 0.7555556
… which is better than all other results above:
Why, what did I guess? What everyone should have guessed: a measure of central tendency, the statistic you would provide for a variable of interest, a response variable, if you are only allowed to provide a single value. How can we compute this here?
Thus, when I asked you to predict for every single case in the data which genitive would be used, you should have always guessed the most frequent level of the response, the mode, which leads to a baseline called no-information rate (because we are not using any information other than the actual response variable itself). If you always, invariably predict the most frequent level of the response, what is the baseline accuracy you would get? The result we already saw:
# baseline := highest proportion in the table of the response
(baseline <- d$GENITIVE %>% table %>% prop.table %>% max)
[1] 0.7555556
As for deviance, in the case of lm
, we defined deviance as the sum of squared residuals and we said we want to reduce the deviance of our model relative to a null model by adding (hopefully good) predictors. What is deviance here in the case of a glm
? To understand that, we first compute a null model:
summary(m_00 <- glm( # make/summarize the gen. linear model m_00:
GENITIVE ~ 1, # GENITIVE ~ an overall intercept (1) & no predictor
family=binomial, # the response is binary
data=d, # those variables are in d
na.action=na.exclude)) # skip cases with NA/missing data (good habit)
Call:
glm(formula = GENITIVE ~ 1, family = binomial, data = d, na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.12847 0.03878 -29.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4004.3 on 3599 degrees of freedom
Residual deviance: 4004.3 on 3599 degrees of freedom
AIC: 4006.3
Number of Fisher Scoring iterations: 4
In a way that will become clearer further below, this model allows us to compute the no-information rate / baseline from above …
… but also the deviance:
We see that the deviance of m_00
is still the sum of this model’s squared residuals and the model’s “null deviance” (and, if a model is a null model, also what is called “residual deviance”). But what are squared residuals here? We might discuss that in more detail later, if we have time. As in the book, we will first discuss some didactically motivated monofactorial models before we go over a ‘proper’ multifactorial modeling approach.
2.2 A binary predictor
Does the choice of a genitive construction vary as a function of the modality in which the construction was produced (spoken vs. written)?
Some exploration of the relevant variables:
of s
2720 880
spoken written
1685 1915
of s
spoken 1232 453
written 1488 427
How could one approach this ‘descriptively’? We could say what the odds are for s-genitives in the spoken data:
… and we could say what the odds are for s-genitives in the written data:
But in some sense, odds are annoying because they have this very different value range for ‘I like s-genitives’ vs. ‘I don’t like s-genitives’, here shown for the spoken part:
1675 / 10 # I like s-genitives a lot
## [1] 167.5
1000 / 685 # I like s-genitives somewhat
## [1] 1.459854
843 / 842 # I like s-genitives pretty much as much as of-genitives
## [1] 1.001188
685 / 1000 # I dislike s-genitives somewhat
## [1] 0.685
10 / 1675 # I don't like s-genitives at all
## [1] 0.005970149
But look what happens if we log the odds:
log(1675 / 10) # I like s-genitives a lot
## [1] 5.120983
log(1000 / 685) # I like s-genitives somewhat
## [1] 0.3783364
log( 843 / 842) # I like s-genitives pretty much as much as of-genitives
## [1] 0.001186944
log( 685 / 1000) # I dislike s-genitives somewhat
## [1] -0.3783364
log( 10 / 1675) # I don't like s-genitives at all
## [1] -5.120983
Much better, so let’s do that for our real data:
Keep those numbers in mind. But most people would probably express this in probabilities:
We’ll revisit this in a moment …
2.2.1 Modeling & numerical interpretation
How would we have dealt with this without a model? First, we would have computed a chi-squared test like this, where the chi-squared value is computed involving differences of observed and expected values:
Pearson's Chi-squared test
data: mod_x_gen
X-squared = 10.21, df = 1, p-value = 0.001397
Second, we might have computed the odds ratio and the logged odds ratio for the s-genitives:
[1] 0.7804363
[1] -0.2479022
How do we do this now? With a binary logistic regression, which returns the G2-statistic, which is computed based on ratios of observed and expected values (not differences like chi-squared) and which also returns the (logged) odds ratio. Let’s fit our regression model,
summary(m_01 <- glm( # make/summarize the gen. linear model m_01:
GENITIVE ~ 1 + # GENITIVE ~ an overall intercept (1)
MODALITY, # & the predictor MODALITY
family=binomial, # resp = binary
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data
Call:
glm(formula = GENITIVE ~ 1 + MODALITY, family = binomial, data = d,
na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.00050 0.05495 -18.208 < 2e-16 ***
MODALITYwritten -0.24790 0.07767 -3.192 0.00141 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4004.3 on 3599 degrees of freedom
Residual deviance: 3994.1 on 3598 degrees of freedom
AIC: 3998.1
Number of Fisher Scoring iterations: 4
Estimate 2.5 % 97.5 %
(Intercept) -1.0005020 -1.1091198 -0.89367488
MODALITYwritten -0.2479022 -0.4002591 -0.09571937
Is the model significant? Of course: its only predictor is. But even if there were more than one predictor, we could compute the significance from the deviance reduction like this:
pchisq( # compute the area under the chi-squared curve
q=4004.3-3994.1, # for this chi-squared value: null - residual deviance
df=3599-3598, # at this df: null - residual df
lower.tail=FALSE) # only using the right/upper tail/side
[1] 0.001404407
We can also use drop1
again, just with a different test
argument:
drop1(m_01, # drop each droppable predictor at a time from m_01 &
test="Chisq") %>% data.frame # test its significance w/ a LRT
Df Deviance AIC LRT Pr..Chi.
<none> NA 3994.079 3998.079 NA NA
MODALITY 1 4004.273 4006.273 10.19365 0.001409249
(Note how similar the chi-squared value and the G2-statistic and their p-values are.)
But what are the coefficients?
Recognize the intercept?
(Intercept)
-1.000502
of s
spoken 1232 453
written 1488 427
[1] -1.000502
Recognize what the MODALITY
coefficient is?
"+"( # add to
log(453 / 1232), # coef(m_01)[1], the intercept
coef(m_01)[2]) # coef(m_01)[2], the coef for MODALITY
MODALITYwritten
-1.248404
[1] -1.248404
Let’s make ‘predictions’, which we add to the data: first, we compute the predictions of the model. The function predict
, used like here, does the same coef
times model.matrix
thing as always:
d$PRED_LO_s <- predict( # make d$PRED_LO_s the result of predicting
m_01) # log odds of s-genitives from m_01
picky <- c(1, 3600)
sum(coef(m_01) * model.matrix(m_01)[1,])
[1] -1.000502
[1] -1.248404
CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1 2 of nns spoken 13 7 collective
3600 4040 s ns written 7 6 animate
POR_FINAL_SIB POR_DEF PRED_LO_s
1 absent definite -1.000502
3600 absent definite -1.248404
Second, we compute predicted probabilities, note the type="response"
:
d$PRED_PP_s <- predict( # make d$PRED_PP_s the result of predicting
m_01, # from m_01
type="response") # predicted probabilities of "s"
d[c(1,3600),]
CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1 2 of nns spoken 13 7 collective
3600 4040 s ns written 7 6 animate
POR_FINAL_SIB POR_DEF PRED_LO_s PRED_PP_s
1 absent definite -1.000502 0.2688427
3600 absent definite -1.248404 0.2229765
Would we be able to get the predicted probabilities from the (logged) odds?
1 3600
0.2688427 0.2229765
1 3600
0.2688427 0.2229765
Then, we compute the categorical predictions: if the predicted probability of s-genitives is greater than that of of-genitives, we predict s-genitives – how?
Let’s now evaluate the predictive capacity of the model. How well is the model doing?
(c_m <- table( # confusion matrix: cross-tabulate
"OBS" =d$GENITIVE, # observed genitives in the rows
"PREDS"=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS of s
of 2720 0
s 880 0
Obviously, very badly. We can now compute a variety of statistics although a confusion matrix with only one prediction is not the greatest vehicle to explain them – but that’s what we have, so … To define these statistics, it is useful to start with two easy distinctions:
- positives vs. negatives: positives are the cases where the model predicts the level it was tasked to predict, i.e. here s, negatives are when it predicts the other level. (This terminology comes from applications where the two levels of the response are massively different qualitatively and in terms of impact and often also in terms of frequency. For example, when you do cancer screening, the test is tasked with finding cancer so that would be the ‘positive’ prediction; when credit card companies monitor transactions for being likely fraudulent, then the algorithm saying ‘this use of your credit card is fraudulent’ is the ‘positive’ prediction – one way of putting it is that the positives are the predictions that are important and require (a re)action.)
- true vs. false is just whether the prediction of the model is correct or not.
That means, we can define the four possible outcomes like shown here in Table 1:
GENITIVE
)
predicted: GENITIVE : of |
predicted: GENITIVE : s |
|
---|---|---|
observed: GENITIVE : of |
2720 (true negatives, tn) | 0 (false positives, fp) |
observed: GENITIVE : s |
880 (false negatives, fn) | 0 (true positives, tp) |
With these notions in place, we can now define some widely used metrics: First the simplest one, accuracy. Accuracy answers the following question: “Out of all the predictions the model made, how many (as a proportion) are correct?”. That means it is the sum of the main diagonal of c_m
out of all predictions (the sum of c_m
); this value is probably always reported although it’s far from ideal.
\[accuracy = \frac{tp+tn}{tp+tn+fp+fn}=\frac{0+2720}{0+2720+0+880}=0.7555556\]
A second useful metric is precision. Precision (also sometimes called positive predictive value) answers the following question: “Out of all the times the model predicted what it was supposed to predict (here, s), how often (as a proportion) was it right?”. Accordingly, it is computed as follows:
\[precision = \frac{tp}{tp+fp}=\frac{0}{0+0}=\mathrm{not~defined,~set~to~0}\]
The next one is recall. Recall (also sometimes called sensitivity or the true positive rate/TPR) answers the following question: “Out of all the instances the model was supposed to predict, how many (as a proportion) did it predict correctly?”. Therefore, this is how we get it:
\[recall/sensitivity/TPR = \frac{tp}{tp+fn}=\frac{0}{0+880}=0\]
That also means that the recall score for each level of the response variable is the accuracy for that level.
The last metric is specificity. Specificity (also sometimes called true negative rate/TNR) answers the question “Out of all the instances the model was not supposed to predict, how many (as a proportion) did it predict correctly?”. I hope you notice that means that specificity is ‘recall for the not predicted level of the response’ so this is how we get that:
\[specificity/TNR = \frac{tn}{tn+fp}=\frac{2720}{2720+0}=1\]
c( # evaluate the confusion matrix
"Prec. for s" =c_m[ "s", "s"] / sum(c_m[ , "s"]),
"Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s", ]),
"Prec. for of" =c_m["of","of"] / sum(c_m[ ,"of"]),
"Acc./rec. for of"=c_m["of","of"] / sum(c_m["of", ]),
"Acc. (overall)" =mean(d$GENITIVE==d$PRED_CAT))
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
NaN 0.0000000 0.7555556 1.0000000
Acc. (overall)
0.7555556
McFadden R-squared Nagelkerke R-squared Cohen's kappa
0.002545694 0.004212717 0.000000000
C
0.530915775
Here, you also see Cohen’s kappa and the C-score. The former is a measure widely used to quantify interrater reliability that, here, helps contextualize an accuracy value against chance. Specifically, Cohen’s κ is computed like this:
\[\mathrm{Cohen's}~\kappa = \frac{accuracy - chance~accuracy}{1-chance~accuracy}\]
The C-score, following Baayen (2008:204) and Hosmer, Lemeshow, & Sturdivant (2013:177), C-scores of >=0.8 are considered good or “excellent” so we’re not really making the cut here, we’re not even getting what Hosmer et al. call “acceptable” (for which C would need to be ≥0.7) … Again, a terrible model in terms of predictive capability.
Side question: what are the R-squareds? McFadden is easy, because it’s pretty much the same thing as R-squared in linear models: the percentage of deviance reduction by a model (sometimes this is called R2L by Cohen, e.g. in Wikipedia):
Note that McFadden’s R2 is often surprisingly small, one needs really confident and correct predictions for this coefficient to become what seems ‘nicely high’. Another R2, which seems to be much more widely used (Nagelkerke’s R2), usually paints a much rosier picture (see Wikipedia).
2.2.2 Visual interpretation
Let’s visualize the nature of the effect of MODALITY
based on the predictions:
mo_d <- data.frame( # make mo_d a data frame of
mo <- predictorEffect( # the effect
"MODALITY", # of MODALITY
m_01)) # in m_01
mo_d$fit_lo <- qlogis(mo_d$fit)
mo_d
MODALITY fit se lower upper fit_lo
1 spoken 0.2688427 0.010800758 0.2482073 0.2905308 -1.000502
2 written 0.2229765 0.009511784 0.2048903 0.2421729 -1.248404
plot(mo, # plot the effect of MODALITY
type="response", # but plot predicted probabilities
ylim=c(0, 1), # ylim: the whole range of possible probabilities
grid=TRUE) # & w/ a grid
2.2.3 Diagnostics 2
Binary logistic regression models also have assumptions that need to be met, but I will only discuss another rule of thumb regarding the amount of data one needs here. One rule of thumb (!) is that the number of data points featuring the less frequent level of the response variable in the model must be greater than the product of the number of values estimated in the model and 15 or, better even, 20.
How many data points do we have?
How many values are estimated in the model? The best way to get that number is this:
So we may be fine …
Note: one should again minimally also check that one has enough data points for all (combinations of) levels of predictors and their interactions!
2.2.4 Write-up
To determine whether the choice of a genitive varies as a function of the modality (spoken vs. written), a generalized linear model was fit with GENITIVE
as the response variable and MODALITY
as a predictor. The model was highly significant (LR=10.19, df=1, p<0.002) but explains very little of the response variable (Nagelkerke R2=0.004, McFadden’s R2=0.003, C=0.53) and has no predictive power (Cohen’s κ=0): the model only ever predicts of-genitives. The coefficients indicate that s-genitives are more likely in spoken data than in written data, as shown in the summary table shown here. [Show a plot]
Estimate | 95%-CI | se | z | p2-tailed | |
---|---|---|---|---|---|
Intercept (MODALITY=spoken) | -1 | [-1.11, -0.89] | 0.05 | -18.21 | <0.001 |
MODALITYspoken→written | -0.25 | [-0.4, -0.1] | 0.08 | -3.19 | <0.002 |
Some housekeeping:
'data.frame': 3600 obs. of 9 variables:
$ CASE : int 2 3 4 5 6 7 8 9 10 11 ...
$ GENITIVE : Factor w/ 2 levels "of","s": 1 1 1 1 2 2 2 2 2 2 ...
$ SPEAKER : Factor w/ 2 levels "nns","ns": 1 1 1 1 1 1 1 1 1 1 ...
$ MODALITY : Factor w/ 2 levels "spoken","written": 1 1 1 1 1 1 1 1 1 1 ...
$ POR_LENGTH : int 13 22 11 26 8 7 6 5 5 5 ...
$ PUM_LENGTH : int 7 7 8 4 4 3 36 3 5 5 ...
$ POR_ANIMACY : Factor w/ 5 levels "animate","collective",..: 2 1 1 2 1 1 1 1 1 1 ...
$ POR_FINAL_SIB: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 1 ...
$ POR_DEF : Factor w/ 2 levels "definite","indefinite": 1 1 1 1 1 1 1 1 1 1 ...
2.3 A categorical predictor
We now switch to a categorical predictor and ask, does the choice of a genitive construction vary as a function on the degree/kind of animacy of the possessor (animate vs. collective vs. inanimate vs. locative vs. temporal)? How would we explore/describe this? # replace with your own code/answers
of s
animate 370 550
collective 408 199
inanimate 1638 33
locative 199 44
temporal 105 54
2.3.1 Modeling & numerical interpretation
How would we have dealt without a model? Again a chi-squared test, but now we model this right away:
summary(m_01 <- glm( # make/summarize the gen. linear model m_01:
GENITIVE ~ 1 + # GENITIVE ~ an overall intercept (1)
POR_ANIMACY, # & the predictor POR_ANIMACY
family=binomial, # resp = binary
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data
Call:
glm(formula = GENITIVE ~ 1 + POR_ANIMACY, family = binomial,
data = d, na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.39642 0.06724 5.896 3.73e-09 ***
POR_ANIMACYcollective -1.11438 0.10953 -10.174 < 2e-16 ***
POR_ANIMACYinanimate -4.30114 0.18823 -22.851 < 2e-16 ***
POR_ANIMACYlocative -1.90553 0.17965 -10.607 < 2e-16 ***
POR_ANIMACYtemporal -1.06139 0.18045 -5.882 4.06e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4004.3 on 3599 degrees of freedom
Residual deviance: 2766.0 on 3595 degrees of freedom
AIC: 2776
Number of Fisher Scoring iterations: 6
Estimate 2.5 % 97.5 %
(Intercept) 0.3964153 0.2651511 0.5288144
POR_ANIMACYcollective -1.1143776 -1.3303631 -0.9008787
POR_ANIMACYinanimate -4.3011390 -4.6879154 -3.9477968
POR_ANIMACYlocative -1.9055305 -2.2682451 -1.5625579
POR_ANIMACYtemporal -1.0613916 -1.4207695 -0.7121414
Is the model significant? Of course: the only predictor is. But even if there were more than one predictor, we could compute the significance from the deviance reduction like this:
pchisq( # compute the area under the chi-squared curve
q=4004.3-2766, # for this chi-squared value: null - residual deviance
df=3599-3595, # at this df: null - residual df
lower.tail=FALSE) # only using the right/upper tail/side
[1] 7.926259e-267
We can also use drop1
again, but remember the test
argument:
drop1(m_01, # drop each droppable predictor at a time from m_01 &
test="Chisq") %>% data.frame # test its significance w/ a LRT
Df Deviance AIC LRT Pr..Chi.
<none> NA 2765.995 2775.995 NA NA
POR_ANIMACY 4 4004.273 4006.273 1238.278 8.015654e-267
The coefficients have the same function as always:
(Intercept) POR_ANIMACYcollective POR_ANIMACYinanimate
0.3964153 -1.1143776 -4.3011390
POR_ANIMACYlocative POR_ANIMACYtemporal
-1.9055305 -1.0613916
Meaning,
- the intercept is the prediction when the predictor is its first level (animate);
- coefficient 2 is how you get from the intercept (animate) to the prediction when the predictor is its 2nd level (collective), and that difference is significant;
- coefficient 3 is how you get from the intercept (animate) to the prediction for when the predictor is its 3rd level (inanimate), and that difference is significant;
- coefficient 4 is how you get from the intercept (animate) to the prediction for when the predictor is its 4th level (locative), and that difference is significant;
- coefficient 5 is how you get from the intercept (animate) to the prediction for when the predictor is its 5th level (temporal), and that difference is significant:
GENITIVE POR_ANIMACY
2 of animate
1 of collective
17 of inanimate
55 of locative
65 s temporal
(Intercept) POR_ANIMACYcollective POR_ANIMACYinanimate POR_ANIMACYlocative
2 1 0 0 0
1 1 1 0 0
17 1 0 1 0
55 1 0 0 1
65 1 0 0 0
POR_ANIMACYtemporal
2 0
1 0
17 0
55 0
65 1
2 1 17 55 65
0.3964153 -0.7179623 -3.9047237 -1.5091152 -0.6649763
But it is of course much easier to do that with emmeans
again (which gives us predictions on the log odds scale) …
Estimate Std. Error
(Intercept) 0.3964153 0.06723752
POR_ANIMACYcollective -1.1143776 0.10953077
POR_ANIMACYinanimate -4.3011390 0.18822934
POR_ANIMACYlocative -1.9055305 0.17964766
POR_ANIMACYtemporal -1.0613916 0.18045280
data.frame(qwe <- emmeans( # compute estimated marginal means
object=m_01, # for the model m_01
specs= ~ POR_ANIMACY)) # namely all levels of the predictor
POR_ANIMACY emmean SE df asymp.LCL asymp.UCL
1 animate 0.3964153 0.06723752 Inf 0.2646321 0.5281984
2 collective -0.7179623 0.08646448 Inf -0.8874296 -0.5484951
3 inanimate -3.9047237 0.17581069 Inf -4.2493063 -3.5601411
4 locative -1.5091152 0.16659051 Inf -1.8356266 -1.1826038
5 temporal -0.6649763 0.16745844 Inf -0.9931888 -0.3367638
… or with effects
(which gives us predictions on the probability scale):
an_d <- data.frame( # make an_d a data frame of
an <- predictorEffect( # the effect
"POR_ANIMACY", # of POR_ANIMACY
m_01)) # in m_01
an_d$fit_lo <- qlogis(an_d$fit)
an_d
POR_ANIMACY fit se lower upper fit_lo
1 animate 0.59782609 0.016165922 0.56577463 0.62906282 0.3964153
2 collective 0.32784185 0.019053448 0.29164055 0.36621363 -0.7179623
3 inanimate 0.01974865 0.003403457 0.01407325 0.02764863 -3.9047237
4 locative 0.18106996 0.024702646 0.13756935 0.23458435 -1.5091152
5 temporal 0.33962264 0.037557428 0.27028269 0.41659580 -0.6649763
We can therefore add the predictions to the data frame as usual:
d$PRED_PP_s <- predict( # make d$PRED_PP_s the result of predicting
m_01, # from m_01
type="response") # predicted probabilities of "s"
d$PRED_LO_s <- predict( # make d$PRED_PP_s the result of predicting
m_01) # from m_01
d$PRED_CAT <- factor( # make d$PRED_CAT the result of checking
ifelse(d$PRED_PP_s>=0.5, # if the predicted prob. of "s" is >=0.5
levels(d$GENITIVE)[2], # if yes, predict "s"
levels(d$GENITIVE)[1]), # otherwise, predict "of"
levels=levels(d$GENITIVE)) # make sure both levels are registered
d[2:1,]
CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
2 3 of nns spoken 22 7 animate
1 2 of nns spoken 13 7 collective
POR_FINAL_SIB POR_DEF PRED_PP_s PRED_LO_s PRED_CAT
2 absent definite 0.5978261 0.3964153 s
1 absent definite 0.3278418 -0.7179623 of
(I am skipping PRED_PP_obs
now, but you shouldn’t!)
But let’s now evaluate the predictive capacity of the model. How well is the model doing?
(c_m <- table( # confusion matrix: cross-tabulate
"OBS" =d$GENITIVE, # observed genitives in the rows
"PREDS"=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS of s
of 2350 370
s 330 550
c( # evaluate the confusion matrix
"Prec. for s" =c_m[ "s", "s"] / sum(c_m[ , "s"]),
"Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s", ]),
"Prec. for of" =c_m["of","of"] / sum(c_m[ ,"of"]),
"Acc./rec. for of"=c_m["of","of"] / sum(c_m["of", ]),
"Acc. (overall)" =mean(d$GENITIVE==d$PRED_CAT))
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
0.5978261 0.6250000 0.8768657 0.8639706
Acc. (overall)
0.8055556
McFadden R-squared Nagelkerke R-squared Cohen's kappa
0.3092390 0.4336234 0.4815668
C
0.8472389
This model is obviously doing quite well!
We have a predictor with multiple levels now – what might we want to do? Post-hoc comparisons:
pairs( # compute pairwise comparisons
emmeans( # of means computed
object=m_01, # for the model m_01
specs= ~ POR_ANIMACY), # namely all contrasts of this predictor
adjust="holm") # now we adjust because we're fishing
contrast estimate SE df z.ratio p.value
animate - collective 1.114 0.110 Inf 10.174 <.0001
animate - inanimate 4.301 0.188 Inf 22.851 <.0001
animate - locative 1.906 0.180 Inf 10.607 <.0001
animate - temporal 1.061 0.180 Inf 5.882 <.0001
collective - inanimate 3.187 0.196 Inf 16.265 <.0001
collective - locative 0.791 0.188 Inf 4.215 0.0001
collective - temporal -0.053 0.188 Inf -0.281 0.7786
inanimate - locative -2.396 0.242 Inf -9.891 <.0001
inanimate - temporal -3.240 0.243 Inf -13.343 <.0001
locative - temporal -0.844 0.236 Inf -3.574 0.0007
Results are given on the log odds ratio (not the response) scale.
P value adjustment: holm method for 10 tests
Every prediction is significantly different from every other one, with just one exception.
2.3.2 Visual interpretation
Let’s visualize the nature of the effect of POR_ANIMACY
based on the predictions; for something we’ll need in a moment, we also add log odds predictions:
2.3.3 Excursus: conflation via model comparison
Given a lot of previous literature on the genitive alternation, you might actually have hypothesized that the only distinction one needs is one between animate (and maybe collective) and all inanimate possessors. Let’s use model comparison to check this. First, we create a version of POR_ANIMACY
that conflates levels as desired:
d$POR_ANIMACY_confl <- d$POR_ANIMACY # create a copy of POR_ANIMACY
levels(d$POR_ANIMACY_confl) <- # overwrite the copy's levels such that
c("anim/coll", "anim/coll", # "animate" and "collective" become "anim/coll"
"inanim/loc/temp", "inanim/loc/temp", "inanim/loc/temp") # all others become "inanim/loc/temp"
table(POR_ANIMACY=d$POR_ANIMACY, POR_ANIMACY_confl=d$POR_ANIMACY_confl) # check you did it right
POR_ANIMACY_confl
POR_ANIMACY anim/coll inanim/loc/temp
animate 920 0
collective 607 0
inanimate 0 1671
locative 0 243
temporal 0 159
CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
2 3 of nns spoken 22 7 animate
1 2 of nns spoken 13 7 collective
17 20 of nns spoken 17 12 inanimate
55 59 of nns spoken 7 11 locative
65 72 s nns spoken 5 8 temporal
POR_FINAL_SIB POR_DEF PRED_PP_s PRED_LO_s PRED_CAT POR_ANIMACY_confl
2 absent definite 0.59782609 0.3964153 s anim/coll
1 absent definite 0.32784185 -0.7179623 of anim/coll
17 absent indefinite 0.01974865 -3.9047237 of inanim/loc/temp
55 absent definite 0.18106996 -1.5091152 of inanim/loc/temp
65 absent definite 0.33962264 -0.6649763 of inanim/loc/temp
We already have the model with POR_ANIMACY
as the only predictor so we only need to fit a new model with POR_ANIMACY_confl
as the only predictor:
summary(m_01_animconfl <- glm( # make/summarize the gen. linear model m_01_animconfl:
GENITIVE ~ 1 + POR_ANIMACY_confl, # GENITIVE ~ an overall intercept (1) & POR_ANIMACY_confl
family=binomial, data=d, na.action=na.exclude)) # the other arguments
Call:
glm(formula = GENITIVE ~ 1 + POR_ANIMACY_confl, family = binomial,
data = d, na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.03799 0.05119 -0.742 0.458
POR_ANIMACY_conflinanim/loc/temp -2.65829 0.10377 -25.616 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4004.3 on 3599 degrees of freedom
Residual deviance: 3093.4 on 3598 degrees of freedom
AIC: 3097.4
Number of Fisher Scoring iterations: 5
… and then we do a comparison with anova
just like above, the only difference being the argument to test
(because we’re now fitting a generalized linear model).
anova(m_01, # compare the model w/ the 5-level predictor POR_ANIMACY
m_01_animconfl, # & the model w/ the 2-level predictor POR_ANIMACY_confl
test="Chisq") %>% data.frame # using a LRT
Resid..Df Resid..Dev Df Deviance Pr..Chi.
1 3595 2765.995 NA NA NA
2 3598 3093.390 -3 -327.3942 1.169659e-70
The difference between the two models is hugely significant, we cannot reduce POR_ANIMACY
to the only two levels we were considering.
2.3.4 Write-up
To determine whether the choice of a genitive varies as a function of the degree/kind of animacy of the possessor (animate vs. collective vs. inanimate vs. locative vs. temporal), a generalized linear model was fit with GENITIVE
as the response variable and POR_ANIMACY
as a predictor. The model was highly significant (LR=1238.3, df=4, p<0.001) and explains a decent amount of the response variable (Nagelkerke R2=0.434, McFadden’s R2=0.309, Cohen’s κ=0.482, C=0.85) with a good amount of predictive power: [list precision & accuracy/recall scores for both genitives]. The summary table of the model is shown here [I changed the order of the coefficients for maybe better interpretability].
Estimate | 95%-CI | se | z | p2-tailed | |
---|---|---|---|---|---|
Intercept (POR_ANIMACY=animate) | 0.40 | [0.27, 0.53] | 0.07 | 5.9 | <0.001 |
POR_ANIMACYanimate→temporal | -1.06 | [-1.42, -0.71] | 0.18 | -5.88 | <0.002 |
POR_ANIMACYanimate→collective | -1.11 | [-1.33, -0.9] | 0.11 | -10.17 | <0.002 |
POR_ANIMACYanimate→locative | -1.91 | [-2.27, -1.56] | 0.18 | -10.61 | <0.002 |
POR_ANIMACYanimate→inanimate | -4.30 | [-4.69, -3.95] | 0.19 | -22.85 | <0.002 |
The model predicts s-genitives for animate possessors but of-genitives for the rest in the order of temporal/collective possessors, then locative possessors, and then very strongly for inanimate possessors. [Show a plot] A post-hoc comparison of the above model with one in which POR_ANIMACY
was reduced to 2 levels (animate/collective vs. inanimate/locative/temporal) showed that this reduced predictor is highly significantly worse (LR=327.39, df=3, p<0.0001).
'data.frame': 3600 obs. of 9 variables:
$ CASE : int 2 3 4 5 6 7 8 9 10 11 ...
$ GENITIVE : Factor w/ 2 levels "of","s": 1 1 1 1 2 2 2 2 2 2 ...
$ SPEAKER : Factor w/ 2 levels "nns","ns": 1 1 1 1 1 1 1 1 1 1 ...
$ MODALITY : Factor w/ 2 levels "spoken","written": 1 1 1 1 1 1 1 1 1 1 ...
$ POR_LENGTH : int 13 22 11 26 8 7 6 5 5 5 ...
$ PUM_LENGTH : int 7 7 8 4 4 3 36 3 5 5 ...
$ POR_ANIMACY : Factor w/ 5 levels "animate","collective",..: 2 1 1 2 1 1 1 1 1 1 ...
$ POR_FINAL_SIB: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 1 ...
$ POR_DEF : Factor w/ 2 levels "definite","indefinite": 1 1 1 1 1 1 1 1 1 1 ...
2.4 A numeric predictor
We now switch to a numeric predictor and ask, does the choice of a genitive construction vary as a function of the length of the possessor? How would we explore/describe this? # replace with your own code/answers
Maybe not …
d$POR_LENGTH_LOG <- log2(d$POR_LENGTH)
par(mfrow=c(1,2))
hist(d$POR_LENGTH_LOG, breaks="FD")
spineplot(d$GENITIVE ~ d$POR_LENGTH_LOG)
par(mfrow=c(1,1))
Now we’re talking …
2.4.1 Modeling & numerical interpretation
Let’s fit our model:
summary(m_01 <- glm( # make/summarize the gen. linear model m_01
GENITIVE ~ 1 + # GENITIVE ~ an overall intercept (1)
POR_LENGTH_LOG, # & the predictor POR_LENGTH_LOG
family=binomial, # resp = binary
data=d, # those vars = in x
na.action=na.exclude)) # skip cases with NA/missing data
Call:
glm(formula = GENITIVE ~ 1 + POR_LENGTH_LOG, family = binomial,
data = d, na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.90960 0.17609 10.85 <2e-16 ***
POR_LENGTH_LOG -0.90493 0.05325 -16.99 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4004.3 on 3599 degrees of freedom
Residual deviance: 3643.1 on 3598 degrees of freedom
AIC: 3647.1
Number of Fisher Scoring iterations: 4
Estimate 2.5 % 97.5 %
(Intercept) 1.9096029 1.567704 2.2581724
POR_LENGTH_LOG -0.9049291 -1.010777 -0.8019605
pchisq( # compute the area under the chi-squared curve
q=4004.3-3643.1, # for this chi-squared value: null - residual deviance
df=3599-3598, # at this df: null - residual df
lower.tail=FALSE) # only using the right/upper tail/side
[1] 1.542725e-80
drop1(m_01, # drop each droppable predictor at a time from m_01 &
test="Chisq") %>% data.frame # test its significance w/ a LRT
Df Deviance AIC LRT Pr..Chi.
<none> NA 3643.097 3647.097 NA NA
POR_LENGTH_LOG 1 4004.273 4006.273 361.1758 1.561528e-80
The coefficients mean what? The longer the possessor, the less likely s-genitives become:
- the intercept is the prediction when the predictor is 0;
- coefficient 2 is how you get from the intercept to the prediction when the predictor increases by 1, and that difference is significant.
Let’s check the predictions:
GENITIVE POR_LENGTH_LOG POR_LENGTH
1133 of 0 1
873 s 1 2
(Intercept) POR_LENGTH_LOG
1133 1 0
873 1 1
[1] 1.909603
[1] 1.004674
As always, we can compute them more easily with effects
:
(relevants_l <- c(
0:8, # these help us understand the slope
# and cover the range of the actual values
mean(d$POR_LENGTH_LOG)) %>% sort) # the central tendency
[1] 0.000000 1.000000 2.000000 3.000000 3.534649 4.000000 5.000000 6.000000
[9] 7.000000 8.000000
(le_d <- data.frame( # make le_d a data frame of
le <- effect( # the effect
"POR_LENGTH_LOG", # of POR_LENGTH_LOG
m_01, # in m_01
# but predictions for our relevant values:
xlevels=list(POR_LENGTH_LOG=relevants_l))))
POR_LENGTH_LOG fit se lower upper
1 0.000000 0.870974529 0.019788298 0.826995055 0.905057520
2 1.000000 0.731976519 0.024503401 0.681327195 0.777208089
3 2.000000 0.524915543 0.019097226 0.487415805 0.562136377
4 3.000000 0.308917647 0.009060026 0.291448049 0.326951196
5 3.534649 0.216021562 0.007488942 0.201704834 0.231060347
6 4.000000 0.153149017 0.007572996 0.138887878 0.168587815
7 5.000000 0.068176570 0.006568341 0.056375854 0.082232176
8 6.000000 0.028749293 0.004293408 0.021429263 0.038471473
9 7.000000 0.011833690 0.002404071 0.007940200 0.017602483
10 8.000000 0.004821536 0.001237943 0.002913467 0.007969237
And we add our predictions to the data frame as usual …
d$PRED_PP_s <- predict( # make d$PRED_PP_s the result of predicting
m_01, # from m_01
type="response") # predicted probabilities of "s"
d$PRED_LO_s <- predict( # make d$PRED_PP_s the result of predicting
m_01) # from m_01
d$PRED_CAT <- factor( # make d$PRED_CAT the result of checking
ifelse(d$PRED_PP_s>=0.5, # if the predicted prob. of "s" is >=0.5
levels(d$GENITIVE)[2], # if yes, predict "s"
levels(d$GENITIVE)[1]), # otherwise, predict "of"
levels=levels(d$GENITIVE)) # make sure both levels are registered
d[picky,]
CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1133 1259 of nns spoken 1 23 inanimate
873 973 s nns spoken 2 12 collective
POR_FINAL_SIB POR_DEF POR_LENGTH_LOG PRED_PP_s PRED_LO_s PRED_CAT
1133 absent indefinite 0 0.8709745 1.909603 s
873 absent definite 1 0.7319765 1.004674 s
… and evaluate them as usual:
(c_m <- table( # confusion matrix: cross-tabulate
"OBS" =d$GENITIVE, # observed genitives in the rows
"PREDS"=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS of s
of 2614 106
s 809 71
c( # evaluate the confusion matrix
"Prec. for s" =c_m[ "s", "s"] / sum(c_m[ , "s"]),
"Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s", ]),
"Prec. for of" =c_m["of","of"] / sum(c_m[ ,"of"]),
"Acc./rec. for of"=c_m["of","of"] / sum(c_m["of", ]),
"Acc. (overall)" =mean(d$GENITIVE==d$PRED_CAT))
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
0.40112994 0.08068182 0.76365761 0.96102941
Acc. (overall)
0.74583333
McFadden R-squared Nagelkerke R-squared Cohen's kappa
0.09019761 0.14222054 0.05715463
C
0.70785741
This model is obviously doing not so well. Let’s see why that is.
2.4.2 Visual interpretation
Here’s a plot:
plot(le, # plot the effect of POR_LENGTH_LOG
type="response", # but plot predicted probabilities
ylim=c(0, 1), # ylim: the whole range of possible probabilities
grid=TRUE) # & w/ a grid
Looks like a nice line – why does it do so badly? In part because there are very few predicted probabilities that exceed 0.5:
2.4.3 Write-up
To determine whether the choice of a genitive varies as a function of the length of the possessor, a generalized linear model was fit with GENITIVE
as the response variable and POR_LENGTH
as a predictor; however, the initial model resulted in extremely poor fit probably in part because of the very long right tail of the predictor, which was therefore logged to the base of 2. A new model with this predictor was highly significant (LR=361.18, df=1, p<0.001) but explains only a small amount of the response variable (Nagelkerke R2=0.142, McFadden’s R2=0.09, Cohen’s κ=0.057, C=0.708) with a fairly bad amount of predictive power: [list precision & accuracy/recall scores for both genitives]. The summary table of the model is shown here.
Estimate | 95%-CI | se | z | p2-tailed | |
---|---|---|---|---|---|
Intercept (POR_LENGTH=0) | 1.91 | [1.57, 2.26] | 0.18 | 10.85 | <0.001 |
POR_LENGTH0→1 | -0.9 | [-1.01, -0.8] | 0.05 | -16.99 | <0.001 |
The model indicates a negative correlation between s-genitives and possessor length: the longer the possessor, the less likely s-genitives become and the more likely of-genitives become. [Show a plot]
2.5 A backwards model selection process
Now, we will deal with this in a multifactorial way. Specifically, we are asking, does the choice of a genitive construction (of vs. s) vary as a function of
- a length-based short-before-long effect, but, this time, if we hypothesize a short-before-long effect, maybe we should not just be looking at the length of the possessor (
POR_LENGTH
), but how the length of the possessor compares to the length of the possessum (PUM_LENGTH
); since such a comparison variable doesn’t exist yet in our data set, we need to create it first; - the degree/kind of animacy of the possessor (
POR_ANIMACY
: animate vs. collective vs. inanimate vs. locative vs. temporal); - whether the speakers are non-native speakers or native speakers of English (
SPEAKER
: nns vs. ns); - any pairwise interaction of these predictors;
- the three-way interaction of these predictors?
Quick reminder:
Some exploration of the relevant variables:
# the predictor(s)/response on its/their own
hist(d$POR_LENGTH-d$PUM_LENGTH, breaks="FD", main="")
hist(d$LEN_PORmPUM_LOG <- log2(d$POR_LENGTH)-log2(d$PUM_LENGTH), breaks="FD", main="")
table(d$POR_ANIMACY)
animate collective inanimate locative temporal
920 607 1671 243 159
nns ns
2666 934
# the predictor(s) w/ the response
spineplot(d$GENITIVE ~ d$LEN_PORmPUM_LOG)
table(d$POR_ANIMACY, d$GENITIVE)
of s
animate 370 550
collective 408 199
inanimate 1638 33
locative 199 44
temporal 105 54
of s
nns 2024 642
ns 696 238
of s
animate nns 272 406
ns 98 144
collective nns 314 138
ns 94 61
inanimate nns 1250 27
ns 388 6
locative nns 117 34
ns 82 10
temporal nns 71 37
ns 34 17
2.5.1 Modeling & numerical evaluation
Let’s fit our initial regression model:
summary(m_01 <- glm( # make/summarize the gen. linear model m_01:
GENITIVE ~ 1 + # GENITIVE ~ an overall intercept (1)
LEN_PORmPUM_LOG*POR_ANIMACY*SPEAKER, # & these predictors & their interaction
family=binomial, # resp = binary
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data
Call:
glm(formula = GENITIVE ~ 1 + LEN_PORmPUM_LOG * POR_ANIMACY *
SPEAKER, family = binomial, data = d, na.action = na.exclude)
Coefficients:
Estimate Std. Error z value
(Intercept) 0.65632 0.09135 7.184
LEN_PORmPUM_LOG -0.70601 0.08154 -8.659
POR_ANIMACYcollective -1.77888 0.16150 -11.015
POR_ANIMACYinanimate -4.32698 0.21686 -19.953
POR_ANIMACYlocative -2.30616 0.27227 -8.470
POR_ANIMACYtemporal -1.32138 0.22536 -5.863
SPEAKERns -0.08278 0.17204 -0.481
LEN_PORmPUM_LOG:POR_ANIMACYcollective -0.46680 0.15054 -3.101
LEN_PORmPUM_LOG:POR_ANIMACYinanimate 0.32099 0.17729 1.811
LEN_PORmPUM_LOG:POR_ANIMACYlocative -0.29249 0.25885 -1.130
LEN_PORmPUM_LOG:POR_ANIMACYtemporal 0.44522 0.18375 2.423
LEN_PORmPUM_LOG:SPEAKERns 0.14245 0.16001 0.890
POR_ANIMACYcollective:SPEAKERns 0.79493 0.29353 2.708
POR_ANIMACYinanimate:SPEAKERns -0.28854 0.51213 -0.563
POR_ANIMACYlocative:SPEAKERns -0.50011 0.49785 -1.005
POR_ANIMACYtemporal:SPEAKERns 0.16161 0.41275 0.392
LEN_PORmPUM_LOG:POR_ANIMACYcollective:SPEAKERns -0.14283 0.28943 -0.493
LEN_PORmPUM_LOG:POR_ANIMACYinanimate:SPEAKERns -0.65372 0.43153 -1.515
LEN_PORmPUM_LOG:POR_ANIMACYlocative:SPEAKERns -0.19485 0.50679 -0.384
LEN_PORmPUM_LOG:POR_ANIMACYtemporal:SPEAKERns -0.53061 0.38091 -1.393
Pr(>|z|)
(Intercept) 6.75e-13 ***
LEN_PORmPUM_LOG < 2e-16 ***
POR_ANIMACYcollective < 2e-16 ***
POR_ANIMACYinanimate < 2e-16 ***
POR_ANIMACYlocative < 2e-16 ***
POR_ANIMACYtemporal 4.54e-09 ***
SPEAKERns 0.63039
LEN_PORmPUM_LOG:POR_ANIMACYcollective 0.00193 **
LEN_PORmPUM_LOG:POR_ANIMACYinanimate 0.07022 .
LEN_PORmPUM_LOG:POR_ANIMACYlocative 0.25849
LEN_PORmPUM_LOG:POR_ANIMACYtemporal 0.01539 *
LEN_PORmPUM_LOG:SPEAKERns 0.37331
POR_ANIMACYcollective:SPEAKERns 0.00677 **
POR_ANIMACYinanimate:SPEAKERns 0.57315
POR_ANIMACYlocative:SPEAKERns 0.31511
POR_ANIMACYtemporal:SPEAKERns 0.69538
LEN_PORmPUM_LOG:POR_ANIMACYcollective:SPEAKERns 0.62166
LEN_PORmPUM_LOG:POR_ANIMACYinanimate:SPEAKERns 0.12980
LEN_PORmPUM_LOG:POR_ANIMACYlocative:SPEAKERns 0.70062
LEN_PORmPUM_LOG:POR_ANIMACYtemporal:SPEAKERns 0.16362
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4004.3 on 3599 degrees of freedom
Residual deviance: 2404.3 on 3580 degrees of freedom
AIC: 2444.3
Number of Fisher Scoring iterations: 7
Now what?
pchisq( # compute the area under the chi-squared curve
q =m_01$null.deviance-m_01$deviance, # for this chi-squared value: null - res. dev.
df=m_01$df.null-m_01$df.residual, # at this df: null - residual df
lower.tail=FALSE) # only using the right/upper tail/side
[1] 0
drop1(m_01, # drop each droppable predictor at a time from m_01 &
test="Chisq") %>% data.frame # test its significance w/ a LRT
Df Deviance AIC LRT Pr..Chi.
<none> NA 2404.267 2444.267 NA NA
LEN_PORmPUM_LOG:POR_ANIMACY:SPEAKER 4 2408.005 2440.005 3.738401 0.4425653
Thus, we delete the predictor with the highest non-significant p-value and fit the 2nd model with that 3-way interaction LEN_PORmPUM_LOG:POR_ANIMACY:SPEAKER
deleted:
summary(m_02 <- update( # make m_02 an update of
m_01, .~. # m_01, namely all of it (.~.), but then
- LEN_PORmPUM_LOG:POR_ANIMACY:SPEAKER)) # remove this interaction
Call:
glm(formula = GENITIVE ~ LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER +
LEN_PORmPUM_LOG:POR_ANIMACY + LEN_PORmPUM_LOG:SPEAKER + POR_ANIMACY:SPEAKER,
family = binomial, data = d, na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.63865 0.08956 7.131 9.99e-13 ***
LEN_PORmPUM_LOG -0.66568 0.07506 -8.869 < 2e-16 ***
POR_ANIMACYcollective -1.75874 0.15934 -11.038 < 2e-16 ***
POR_ANIMACYinanimate -4.30116 0.21500 -20.006 < 2e-16 ***
POR_ANIMACYlocative -2.29342 0.26298 -8.721 < 2e-16 ***
POR_ANIMACYtemporal -1.31543 0.22698 -5.795 6.82e-09 ***
SPEAKERns -0.01779 0.16826 -0.106 0.9158
LEN_PORmPUM_LOG:POR_ANIMACYcollective -0.50145 0.12853 -3.901 9.56e-05 ***
LEN_PORmPUM_LOG:POR_ANIMACYinanimate 0.20466 0.15951 1.283 0.1995
LEN_PORmPUM_LOG:POR_ANIMACYlocative -0.34103 0.22263 -1.532 0.1256
LEN_PORmPUM_LOG:POR_ANIMACYtemporal 0.31318 0.16011 1.956 0.0505 .
LEN_PORmPUM_LOG:SPEAKERns -0.02120 0.11639 -0.182 0.8555
POR_ANIMACYcollective:SPEAKERns 0.72665 0.29266 2.483 0.0130 *
POR_ANIMACYinanimate:SPEAKERns -0.30015 0.48555 -0.618 0.5365
POR_ANIMACYlocative:SPEAKERns -0.55095 0.45355 -1.215 0.2245
POR_ANIMACYtemporal:SPEAKERns 0.08259 0.40494 0.204 0.8384
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4004.3 on 3599 degrees of freedom
Residual deviance: 2408.0 on 3584 degrees of freedom
AIC: 2440
Number of Fisher Scoring iterations: 7
Resid..Df Resid..Dev Df Deviance Pr..Chi.
1 3580 2404.267 NA NA NA
2 3584 2408.005 -4 -3.738401 0.4425653
drop1(m_02, # drop each droppable predictor at a time from m_02 &
test="Chisq") %>% data.frame # test its significance w/ a LRT
Df Deviance AIC LRT Pr..Chi.
<none> NA 2408.005 2440.005 NA NA
LEN_PORmPUM_LOG:POR_ANIMACY 4 2439.771 2463.771 31.76615364 2.135595e-06
LEN_PORmPUM_LOG:SPEAKER 1 2408.038 2438.038 0.03325705 8.552962e-01
POR_ANIMACY:SPEAKER 4 2418.886 2442.886 10.88037857 2.794180e-02
Let’s fit the 3rd model with that interaction LEN_PORmPUM_LOG:SPEAKER
deleted:
summary(m_03 <- update( # make m_03 an update of
m_02, .~. # m_02, namely all of it (.~.), but then
- LEN_PORmPUM_LOG:SPEAKER)) # remove this interaction
Call:
glm(formula = GENITIVE ~ LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER +
LEN_PORmPUM_LOG:POR_ANIMACY + POR_ANIMACY:SPEAKER, family = binomial,
data = d, na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.64082 0.08887 7.211 5.57e-13 ***
LEN_PORmPUM_LOG -0.67066 0.06999 -9.582 < 2e-16 ***
POR_ANIMACYcollective -1.76343 0.15747 -11.198 < 2e-16 ***
POR_ANIMACYinanimate -4.30316 0.21475 -20.038 < 2e-16 ***
POR_ANIMACYlocative -2.29898 0.26155 -8.790 < 2e-16 ***
POR_ANIMACYtemporal -1.31845 0.22659 -5.819 5.93e-09 ***
SPEAKERns -0.02626 0.16145 -0.163 0.87078
LEN_PORmPUM_LOG:POR_ANIMACYcollective -0.50226 0.12847 -3.910 9.25e-05 ***
LEN_PORmPUM_LOG:POR_ANIMACYinanimate 0.20638 0.15920 1.296 0.19486
LEN_PORmPUM_LOG:POR_ANIMACYlocative -0.34169 0.22268 -1.534 0.12492
LEN_PORmPUM_LOG:POR_ANIMACYtemporal 0.31237 0.16007 1.951 0.05100 .
POR_ANIMACYcollective:SPEAKERns 0.73847 0.28489 2.592 0.00954 **
POR_ANIMACYinanimate:SPEAKERns -0.29379 0.48413 -0.607 0.54396
POR_ANIMACYlocative:SPEAKERns -0.53295 0.44234 -1.205 0.22827
POR_ANIMACYtemporal:SPEAKERns 0.08969 0.40281 0.223 0.82380
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4004.3 on 3599 degrees of freedom
Residual deviance: 2408.0 on 3585 degrees of freedom
AIC: 2438
Number of Fisher Scoring iterations: 7
Resid..Df Resid..Dev Df Deviance Pr..Chi.
1 3584 2408.005 NA NA NA
2 3585 2408.038 -1 -0.03325705 0.8552962
drop1(m_03, # drop each droppable predictor at a time from m_03 &
test="Chisq") %>% data.frame # test its significance w/ a LRT
Df Deviance AIC LRT Pr..Chi.
<none> NA 2408.038 2438.038 NA NA
LEN_PORmPUM_LOG:POR_ANIMACY 4 2439.968 2461.968 31.92969 1.977450e-06
POR_ANIMACY:SPEAKER 4 2419.104 2441.104 11.06578 2.583461e-02
Estimate 2.5 % 97.5 %
(Intercept) 0.64081690 0.468616712 0.81724071
LEN_PORmPUM_LOG -0.67065848 -0.810966407 -0.53635908
POR_ANIMACYcollective -1.76342664 -2.077305396 -1.45947457
POR_ANIMACYinanimate -4.30316052 -4.744994412 -3.90029037
POR_ANIMACYlocative -2.29898489 -2.838931650 -1.80923537
POR_ANIMACYtemporal -1.31844526 -1.772188001 -0.88148760
SPEAKERns -0.02626334 -0.341841030 0.29159142
LEN_PORmPUM_LOG:POR_ANIMACYcollective -0.50225743 -0.759750437 -0.25528062
LEN_PORmPUM_LOG:POR_ANIMACYinanimate 0.20637789 -0.102927878 0.52190369
LEN_PORmPUM_LOG:POR_ANIMACYlocative -0.34169459 -0.798758018 0.07811273
LEN_PORmPUM_LOG:POR_ANIMACYtemporal 0.31237087 -0.008354092 0.62161919
POR_ANIMACYcollective:SPEAKERns 0.73846679 0.180371066 1.29820145
POR_ANIMACYinanimate:SPEAKERns -0.29379122 -1.331890328 0.59609089
POR_ANIMACYlocative:SPEAKERns -0.53294681 -1.434208534 0.31251344
POR_ANIMACYtemporal:SPEAKERns 0.08968869 -0.708937514 0.87541391
Let’s compute the overall significance test for this model:
Resid..Df Resid..Dev Df Deviance Pr..Chi.
1 3599 4004.273 NA NA NA
2 3585 2408.038 14 1596.235 0
pchisq( # compute the area under the chi-squared curve
q=4004.3-2408, # for this chi-squared value
df=14, # at 14 df
lower.tail=FALSE) # only using the right/upper tail/side
[1] 0
Let’s make ‘predictions’ and evaluate them:
d$PRED_PP_s <- predict( # make d$PRED_PP_s the result of predicting
m_final, # from m_final
type="response") # predicted probabilities of "s"
# if you don't say type="response", predict gives you log odds
d$PRED_CAT <- factor( # make d$PRED_CAT the result of checking
ifelse(d$PRED_PP_s>=0.5, # if the predicted prob. of "s" is >=0.5
levels(d$GENITIVE)[2], # if yes, predict "s"
levels(d$GENITIVE)[1]), # otherwise, predict "of"
levels=levels(d$GENITIVE)) # make sure both levels are registered
(c_m <- table( # confusion matrix: cross-tabulate
"OBS" =d$GENITIVE, # observed genitives in the rows
"PREDS"=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS of s
of 2454 266
s 310 570
c( # evaluate the confusion matrix
"Prec. for s" =c_m[ "s", "s"] / sum(c_m[ , "s"]),
"Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s", ]),
"Prec. for of" =c_m["of","of"] / sum(c_m[ ,"of"]),
"Acc./rec. for of"=c_m["of","of"] / sum(c_m["of", ]),
"Acc. (overall)" =mean(d$GENITIVE==d$PRED_CAT))
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
0.6818182 0.6477273 0.8878437 0.9022059
Acc. (overall)
0.8400000
McFadden R-squared Nagelkerke R-squared Cohen's kappa
0.3986328 0.5335965 0.5593935
C
0.8986000
This looks like a pretty decent model, not bad at all (esp. Cohen’s kappa and the C-score).
2.5.2 Interpretation
2.5.2.1 The interaction LEN_PORmPUM_LOG:POR_ANIMACY
Let’s visualize the nature of the effect of LEN_PORmPUM_LOG:POR_ANIMACY
based on the predictions:
(lean_d <- data.frame( # make lean_d a data frame of
lean <- effect( # the effect
"LEN_PORmPUM_LOG:POR_ANIMACY", # of LEN_PORmPUM_LOG:POR_ANIMACY
xlevels=list(LEN_PORmPUM_LOG=-4:4), # w/ these values for LEN_PORmPUM_LOG
m_final))) # in m_final
LEN_PORmPUM_LOG POR_ANIMACY fit se lower upper
1 -4 animate 0.964995167 0.010732807 0.9366654268 0.980911180
2 -3 animate 0.933762692 0.015464168 0.8962219877 0.958354439
3 -2 animate 0.878181385 0.019688827 0.8340490730 0.911818185
4 -1 animate 0.786618460 0.020577239 0.7435273242 0.824181653
5 0 animate 0.653396580 0.017519047 0.6183110604 0.686890163
6 1 animate 0.490837160 0.020513568 0.4507767242 0.531015622
7 2 animate 0.330192632 0.029080480 0.2758733167 0.389455090
8 3 animate 0.201334835 0.031196146 0.1470135317 0.269388791
9 4 animate 0.114190794 0.026335613 0.0718292027 0.176777133
10 -4 collective 0.977108479 0.009125137 0.9504633183 0.989578685
11 -3 collective 0.929623169 0.019952030 0.8790212514 0.960022358
12 -2 collective 0.803450188 0.032664294 0.7315682184 0.859772887
13 -1 collective 0.558501674 0.030970559 0.4972271761 0.618044933
14 0 collective 0.281338372 0.022128739 0.2400611313 0.326662174
15 1 collective 0.108056512 0.017076640 0.0788544432 0.146354725
16 2 collective 0.036135870 0.009475954 0.0215225418 0.060062271
17 3 collective 0.011468910 0.004239852 0.0055436094 0.023577327
18 4 collective 0.003577539 0.001705505 0.0014037321 0.009087051
19 -4 inanimate 0.131443260 0.069510582 0.0438729443 0.332938978
20 -3 inanimate 0.086864232 0.037608215 0.0361996368 0.194153942
21 -2 inanimate 0.056422056 0.018427345 0.0294486514 0.105418163
22 -1 inanimate 0.036225313 0.008239215 0.0231206688 0.056329289
23 0 inanimate 0.023081315 0.004063324 0.0163254197 0.032540495
24 1 inanimate 0.014634072 0.003229110 0.0094845324 0.022515951
25 2 inanimate 0.009249061 0.003021352 0.0048681010 0.017503241
26 3 inanimate 0.005833881 0.002646221 0.0023938177 0.014147422
27 4 inanimate 0.003675066 0.002161627 0.0011583720 0.011596083
28 -4 locative 0.904318818 0.065050659 0.6840949593 0.976331696
29 -3 locative 0.774484239 0.095877338 0.5393933482 0.909678349
30 -2 locative 0.555138390 0.087955205 0.3830574676 0.714940059
31 -1 locative 0.311975694 0.043846419 0.2330327080 0.403589135
32 0 locative 0.141455667 0.026043363 0.0976551339 0.200535737
33 1 locative 0.056486712 0.019914540 0.0279778871 0.110735922
34 2 locative 0.021290829 0.011835875 0.0070950857 0.062112445
35 3 locative 0.007842601 0.006002145 0.0017399124 0.034608204
36 4 locative 0.002864010 0.002793417 0.0004221157 0.019161222
37 -4 temporal 0.683948493 0.127825282 0.4044341600 0.873357610
38 -3 temporal 0.601972295 0.109156405 0.3824363698 0.786943563
39 -2 temporal 0.513847009 0.081662166 0.3577119523 0.667325461
40 -1 temporal 0.424852275 0.053325703 0.3250602122 0.531170088
41 0 temporal 0.340476161 0.038786244 0.2689980630 0.420033935
42 1 temporal 0.265132480 0.045062208 0.1865229243 0.362123903
43 2 temporal 0.201370889 0.055382698 0.1137786495 0.331195140
44 3 temporal 0.149817159 0.060437310 0.0650079736 0.308734224
45 4 temporal 0.109649779 0.059635675 0.0358613321 0.289652708
# alternative w/ predictorEffect:
# predictorEffect( # the effect
# "LEN_PORmPUM_LOG", # of LEN_PORmPUM_LOG:POR_ANIMACY
# m_final, # in m_final
# focal.levels=-4:4) # w/ these values for LEN_PORmPUM_LOG
plot(lean, # plot the effect of LEN_PORmPUM_LOG:POR_ANIMACY
type="response", # but plot predicted probabilities
ylim=c(0, 1), # ylim: the whole range of possible probabilities
grid=TRUE) # & w/ a grid
This might be better with a one-panel kind of plot:
plot(lean, # plot the effect of LEN_PORmPUM_LOG:POR_ANIMACY
type="response", # but plot predicted probs
ylim=c(0, 1), grid=TRUE, # w/ these y-axis limits & w/ a grid
multiline=TRUE, # put all lines in one panel
confint=list(style="auto"), # keep confidence intervals
lattice=list(key.args=list( # make the
columns=5, cex.title=0))) # legend have 5 columns & no title
Still not ideal, and do you recognize a couple of (related) general conceptual ‘problems’ here?
- we cannot see whether lines/slopes are significantly different from each other because the lines are curved: there’s no easy way to compare the confidence intervals;
- it’s therefore also not really straightforward to, for instance, rank-order the slopes from the plot;
- it’s therefore also not obvious how we see whether the slopes for, say, inanimates and temporals are significantly different from 0.
To that end, we could look at emtrends
again – to get the slopes …
POR_ANIMACY LEN_PORmPUM_LOG.trend SE df asymp.LCL asymp.UCL
1 animate -0.6706585 0.06999367 Inf -0.8078436 -0.53347340
2 collective -1.1729159 0.10772937 Inf -1.3840616 -0.96177022
3 inanimate -0.4642806 0.14299041 Inf -0.7445366 -0.18402454
4 locative -1.0123531 0.21139547 Inf -1.4266806 -0.59802557
5 temporal -0.3582876 0.14395677 Inf -0.6404377 -0.07613753
… and pairs
to get all pairwise contrasts:
pairs(qwe, # pairwise comparisons of means
adjust="holm") %>% # correct for multiple tests
data.frame # show as a data frame
contrast estimate SE df z.ratio p.value
1 animate - collective 0.5022574 0.1284707 Inf 3.9095082 7.398738e-04
2 animate - inanimate -0.2063779 0.1592023 Inf -1.2963248 5.845908e-01
3 animate - locative 0.3416946 0.2226817 Inf 1.5344527 4.996733e-01
4 animate - temporal -0.3123709 0.1600708 Inf -1.9514543 2.550151e-01
5 collective - inanimate -0.7086353 0.1790304 Inf -3.9581849 6.796935e-04
6 collective - locative -0.1605628 0.2372629 Inf -0.6767298 9.971549e-01
7 collective - temporal -0.8146283 0.1798031 Inf -4.5306679 5.879751e-05
8 inanimate - locative 0.5480725 0.2552142 Inf 2.1474997 1.905211e-01
9 inanimate - temporal -0.1059930 0.2029034 Inf -0.5223814 9.971549e-01
10 locative - temporal -0.6540655 0.2557569 Inf -2.5573717 7.382652e-02
These (differences of) slopes are all on the log odds scale and since those are much easier to interpret – the lines will be straight, not curved like the regression lines in probability space – this plot now shows these straight-line log odds slopes in a more comparable way:
plot(lean, # plot the effect of LEN_PORmPUM_LOG:POR_ANIMACY
# note: no type="response"
ylim=c(-5, 5), grid=TRUE, # w/ these y-axis limits & w/ a grid
multiline=TRUE, # put all lines in one panel
confint=list(style="auto"), # keep confidence intervals
lattice=list(key.args=list( # make the
columns=5, cex.title=0))) # legend have 5 columns & no title
2.5.2.2 The interaction POR_ANIMACY:SPEAKER
Let’s visualize the nature of the effect of POR_ANIMACY:SPEAKER
based on the predictions, as always, we do both perspectives:
(ansp_d <- data.frame( # make ansp_d a data frame of
ansp <- effect( # the effect
"POR_ANIMACY:SPEAKER", # of POR_ANIMACY:SPEAKER
m_final))) # in m_final
POR_ANIMACY SPEAKER fit se lower upper
1 animate nns 0.59115315 0.020183766 0.55108531 0.63004555
2 collective nns 0.16819763 0.021177629 0.13066029 0.21386608
3 inanimate nns 0.02082207 0.004116296 0.01411360 0.03062019
4 locative nns 0.11215912 0.029827786 0.06562281 0.18515703
5 temporal nns 0.30513068 0.047091093 0.22126791 0.40427725
6 animate ns 0.58479067 0.033545821 0.51791474 0.64868241
7 collective ns 0.29188394 0.041924818 0.21690178 0.38020223
8 inanimate ns 0.01520582 0.006205397 0.00680714 0.03361612
9 locative ns 0.06735270 0.024460582 0.03256874 0.13413561
10 temporal ns 0.31874217 0.066141896 0.20480038 0.45944946
# note what this does:
# data.frame(predictorEffect("POR_ANIMACY", m_final))
plot(ansp, # plot the effect of POR_ANIMACY:SPEAKER
type="response", # but plot predicted probabilities
ylim=c(0, 1), # ylim: the whole range of possible probabilities
grid=TRUE) # & w/ a grid
plot(ansp, # plot the effect of POR_ANIMACY:SPEAKER
type="response", # but plot predicted probabilities
ylim=c(0, 1), # ylim: the whole range of possible probabilities
x.var="SPEAKER", # w/ this predictor on the x-axis
grid=TRUE) # & w/ a grid
A multi-line plot might be better:
plot(ansp, # plot the effect of POR_ANIMACY:SPEAKER
type="response", # but plot predicted probabilities
ylim=c(0, 1), # ylim: the whole range of possible probabilities
multiline=TRUE, # all the lines in one panel, but
confint=list(style="auto"), # w/ confidence intervals
grid=TRUE) # & w/ a grid
This certainly calls for emmeans
applications for– but not the desperate one – and …
POR_ANIMACY SPEAKER emmean SE df asymp.LCL asymp.UCL
1 animate nns 0.3687345 0.08351059 Inf 0.20505674 0.5324122
2 collective nns -1.5984552 0.15136912 Inf -1.89513323 -1.3017772
3 inanimate nns -3.8506997 0.20189287 Inf -4.24640242 -3.4549969
4 locative nns -2.0688740 0.29953757 Inf -2.65595684 -1.4817911
5 temporal nns -0.8229836 0.22210061 Inf -1.25829285 -0.3876744
6 animate ns 0.3424711 0.13815636 Inf 0.07168965 0.6132526
7 collective ns -0.8862518 0.20284137 Inf -1.28381353 -0.4886900
8 inanimate ns -4.1707542 0.41439468 Inf -4.98295287 -3.3585556
9 locative ns -2.6280841 0.38939857 Inf -3.39129133 -1.8648770
10 temporal ns -0.7595583 0.30459699 Inf -1.35655743 -0.1625592
SPEAKER POR_ANIMACY emmean SE df asymp.LCL asymp.UCL
1 nns animate 0.3687345 0.08351059 Inf 0.20505674 0.5324122
2 ns animate 0.3424711 0.13815636 Inf 0.07168965 0.6132526
3 nns collective -1.5984552 0.15136912 Inf -1.89513323 -1.3017772
4 ns collective -0.8862518 0.20284137 Inf -1.28381353 -0.4886900
5 nns inanimate -3.8506997 0.20189287 Inf -4.24640242 -3.4549969
6 ns inanimate -4.1707542 0.41439468 Inf -4.98295287 -3.3585556
7 nns locative -2.0688740 0.29953757 Inf -2.65595684 -1.4817911
8 ns locative -2.6280841 0.38939857 Inf -3.39129133 -1.8648770
9 nns temporal -0.8229836 0.22210061 Inf -1.25829285 -0.3876744
10 ns temporal -0.7595583 0.30459699 Inf -1.35655743 -0.1625592
… we compute pairwise comparisons, of which the second is probably the much more relevant one:
pairs(qwe, # pairwise comparisons of means
adjust="holm") %>% # correct for multiple tests
data.frame # show as a data frame
contrast SPEAKER estimate SE df z.ratio
1 animate - collective nns 1.9671897 0.1728775 Inf 11.3790961
2 animate - inanimate nns 4.2194342 0.2184828 Inf 19.3124289
3 animate - locative nns 2.4376085 0.3109610 Inf 7.8389512
4 animate - temporal nns 1.1917181 0.2372819 Inf 5.0223727
5 collective - inanimate nns 2.2522445 0.2523358 Inf 8.9255854
6 collective - locative nns 0.4704188 0.3356119 Inf 1.4016748
7 collective - temporal nns -0.7754715 0.2687774 Inf -2.8851813
8 inanimate - locative nns -1.7818257 0.3612250 Inf -4.9327311
9 inanimate - temporal nns -3.0277160 0.3001490 Inf -10.0873773
10 locative - temporal nns -1.2458903 0.3728960 Inf -3.3411200
11 animate - collective ns 1.2287229 0.2454217 Inf 5.0065784
12 animate - inanimate ns 4.5132254 0.4368182 Inf 10.3320454
13 animate - locative ns 2.9705553 0.4131809 Inf 7.1894793
14 animate - temporal ns 1.1020294 0.3344645 Inf 3.2949070
15 collective - inanimate ns 3.2845025 0.4613757 Inf 7.1189319
16 collective - locative ns 1.7418324 0.4390625 Inf 3.9671628
17 collective - temporal ns -0.1266934 0.3659562 Inf -0.3461984
18 inanimate - locative ns -1.5426701 0.5686424 Inf -2.7129001
19 inanimate - temporal ns -3.4111959 0.5142978 Inf -6.6327245
20 locative - temporal ns -1.8685258 0.4943790 Inf -3.7795415
p.value
1 4.783296e-29
2 4.222404e-82
3 2.725951e-14
4 2.551851e-06
5 3.103438e-18
6 1.610124e-01
7 7.823766e-03
8 3.243511e-06
9 5.026055e-23
10 2.503234e-03
11 3.324366e-06
12 5.047266e-24
13 5.853499e-12
14 2.953628e-03
15 8.701328e-12
16 3.636665e-04
17 7.291936e-01
18 1.333944e-02
19 2.306826e-10
20 6.284697e-04
pairs(asd, # pairwise comparisons of means
adjust="holm") %>% # correct for multiple tests
data.frame # show as a data frame
contrast POR_ANIMACY estimate SE df z.ratio p.value
1 nns - ns animate 0.02626334 0.1614503 Inf 0.1626714 0.870777169
2 nns - ns collective -0.71220345 0.2347303 Inf -3.0341350 0.002412265
3 nns - ns inanimate 0.32005456 0.4564186 Inf 0.7012304 0.483159248
4 nns - ns locative 0.55921015 0.4118245 Inf 1.3578847 0.174500270
5 nns - ns temporal -0.06342535 0.3690377 Inf -0.1718668 0.863542223
Native and non-native speakers only differ with regard to collectives.
2.5.3 Excursus: predictions that are most wrong
To explore which predictions are the worst – wrong predictions made most confidently – how might one do that? We order the rows of the data frame d
by
- the genitives: first of, then s;
- the predicted probabilities of s-genitives: from high to low.
Then, the head of the data frame will provide of-genitives first, but with very high predicted probabilities of s-genitives and the tail of the data frame will provide s-genitives, but with very low predicted probabilities of s-genitives:
CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1108 1231 of nns spoken 5 23 animate
1097 1220 of nns spoken 9 40 animate
1043 1157 of nns spoken 4 25 collective
1042 1156 of nns spoken 4 24 collective
1044 1158 of nns spoken 4 24 collective
3502 3923 of ns written 7 26 animate
POR_FINAL_SIB POR_DEF POR_LENGTH_LOG PRED_PP_s PRED_LO_s PRED_CAT
1108 absent indefinite 2.321928 0.8925792 -0.19157730 s
1097 absent definite 3.169925 0.8893458 -0.95895434 s
1043 absent definite 2.000000 0.8785117 0.09974479 s
1042 absent definite 2.000000 0.8709443 0.09974479 s
1044 absent definite 2.000000 0.8709443 0.09974479 s
3502 absent definite 2.807355 0.8680890 -0.63085414 s
LEN_PORmPUM_LOG
1108 -2.201634
1097 -2.152003
1043 -2.643856
1042 -2.584963
1044 -2.584963
3502 -1.893085
CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
994 1105 s nns spoken 10 5 inanimate
576 649 s nns spoken 19 8 inanimate
2113 2362 s nns written 10 4 inanimate
533 601 s nns spoken 25 9 inanimate
992 1103 s nns spoken 18 4 inanimate
1060 1177 s nns spoken 15 3 inanimate
POR_FINAL_SIB POR_DEF POR_LENGTH_LOG PRED_PP_s PRED_LO_s PRED_CAT
994 absent definite 3.321928 0.015880987 -1.096506 of
576 present definite 4.247928 0.014178730 -1.934470 of
2113 absent definite 3.321928 0.013706427 -1.096506 of
533 present definite 4.643856 0.012784419 -2.292758 of
992 absent definite 4.169925 0.009287107 -1.863883 of
1060 absent definite 3.906891 0.008659769 -1.625856 of
LEN_PORmPUM_LOG
994 1.000000
576 1.247928
2113 1.321928
533 1.473931
992 2.169925
1060 2.321928
2.5.4 Write-up
To determine whether the choice of a genitive construction (of vs. s) varies as a function of
- the length difference between logged possessor length and logged possessum length;
- the degree/kind of animacy of the possessor (
POR_ANIMACY
: animate vs. collective vs. inanimate vs. locative vs. temporal); - whether the speakers are non-native speakers or native speakers of English (
SPEAKER
: nns vs. ns); - any pairwise interaction of these predictors;
- the three-way interaction of these predictors,
A generalized linear model selection process was undertaken (using backwards stepwise model selection of all predictors and controls based on significance testing). The final model resulting from the elimination of ns predictors contained the effects of LEN_PORmPUM_LOG:POR_ANIMACY
and SPEAKER:POR_ANIMACY
and was highly significant (LR=1596.2, df=14, p<0.0001) with a good amount of explanatory/predictive power (McFadden’s R2=0.397, Nagelkerke’s R2=0.534, C=0.899, Cohen’s κ=0.559). The summary table of the model is shown here.
Estimate | 95%-CI | se | z | p2-tailed | |
---|---|---|---|---|---|
(Intercept (LEN_PORmPUM_LOG=0, POR_ANIMACY=animate, SPEAKER=nns) | 0.64 | [0.47, 0.82] | 0.09 | 7.21 | <0.001 |
LEN_PORmPUM_LOG0→1 | -0.67 | [-0.81, -0.54] | 0.07 | -9.58 | <0.001 |
POR_ANIMACYanimate→collective | -1.76 | [-2.08, -1.46] | 0.16 | -11.20 | <0.001 |
POR_ANIMACYanimate→inanimate | -4.30 | [-4.74, -3.90] | 0.21 | -20.04 | <0.001 |
POR_ANIMACYanimate→locative | -2.30 | [-2.84, -1.81] | 0.26 | -8.79 | <0.001 |
POR_ANIMACYanimate→temporal | -1.32 | [-1.77, -0.88] | 0.23 | -5.82 | <0.001 |
SPEAKERnns→ns | -0.03 | [-0.34, 0.29] | 0.16 | -0.16 | 0.8708 |
POR_ANIMACYanimate→collective:LEN_PORmPUM_LOG0→1 | -0.50 | [-0.76, -0.26] | 0.13 | -3.91 | 0.001 |
POR_ANIMACYanimate→inanimate:LEN_PORmPUM_LOG0→1 | 0.21 | [-0.10, 0.52] | 0.16 | 1.30 | 0.1949 |
POR_ANIMACYanimate→locative:LEN_PORmPUM_LOG0→1 | -0.34 | [-0.80, 0.08] | 0.22 | -1.53 | 0.1249 |
POR_ANIMACYanimate→temporal:LEN_PORmPUM_LOG0→1 | 0.31 | [-0.01, 0.62] | 0.16 | 1.95 | 0.051 |
POR_ANIMACYanimate→collective:SPEAKERnns→ns | 0.74 | [0.18, 1.30] | 0.28 | 2.59 | 0.0095 |
POR_ANIMACYanimate→inanimate:SPEAKERnns→ns | -0.29 | [-1.33, 0.60] | 0.48 | -0.61 | 0.544 |
POR_ANIMACYanimate→locative:SPEAKERnns→ns | -0.53 | [-1.43, 0.31] | 0.44 | -1.20 | 0.2283 |
POR_ANIMACYanimate→temporal:SPEAKERnns→ns | 0.09 | [-0.71, 0.88] | 0.40 | 0.22 | 0.8238 |
The results indicate that
- the interaction
LEN_PORmPUM_LOG:POR_ANIMACY
has the effect that- the length difference effect is strongest for collective possessors, followed by locatives, then animates, then inanimates, then temporals, but all of them are significantly different from 0;
- the only differences between slopes for different animacy levels that are significant (acc. to post-hoc tests adjusted for multiple testing with Holm’s procedure) are the following:
- collective vs. animate;
- collective vs. inanimate;
- collective vs. temporal;
- the interaction
SPEAKER:POR_ANIMACY
only has the effect that native speakers are significantly more likely to use s-genitives with collectives than learners are – all other speaker differences within each animacy level are ns. [Show plots]
Finally, the model did particularly badly with length difference values close to 0 or positive and/or inanimate possessors, but also, to a lesser extent, with non-native speakers.
Quiz question: I wrote: “the length difference effect is strongest for collective possessors, followed by locatives, then animates, then inanimates, then temporals, but all of them are significantly different from 0” – how do I know that?!
2.6 Subject realization in Japanese
Imagine you have a binary variable such as the whether native and non-native speakers of Japanese realize subjects in sentences in their conversations; the data we will work with are in _input/subjreal.csv, you can find information about the variables/columns in _input/subjreal.r, and our response variable is called SUBJREAL
:
rm(list=ls(all=TRUE))
summary(d <- read.delim( # the structure of d, the result of loading
file="_input/subjreal.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
CASE SPKR SPEAKERID SUBJREAL CONTRAST
Min. : 1 njs:7371 26-JE_NNS: 821 no :5016 no :6393
1st Qu.: 3814 nns:7394 19-JC_NNS: 769 yes :1649 yes : 271
Median : 7634 18-JK_NJS: 760 NA's:8100 NA's:8101
Mean : 7630 7-JE_NJS : 731
3rd Qu.:11443 24-JE_NNS: 710
Max. :15265 1-JC_NJS : 705
(Other) :10269
GIVENNESS
Min. : 0.000
1st Qu.: 8.000
Median :10.000
Mean : 7.875
3rd Qu.:10.000
Max. :10.000
NA's :8990
Now, we will deal with this in a multifactorial way. Specifically, we are asking, does the choice of subject realization (no vs. yes) vary as a function of
- the degree of givenness of the referent of the subject on a scale from 0 (‘completely new’) to 10 (‘talked about just now’);
- whether the subject would be contrastive: no vs. yes;
- whether the speakers are non-native speakers or native speakers of Japanese (
SPKR
: njs vs. nns); - any pairwise interaction of these predictors;
- the three-way interaction of these predictors?
We first reduce the data set to the complete cases:
CASE SPKR SPEAKERID SUBJREAL CONTRAST
Min. : 1 njs:3176 18-JK_NJS: 386 no :4193 no :5519
1st Qu.: 3562 nns:2599 16-JE_NJS: 359 yes:1582 yes: 256
Median : 7054 8-JE_NJS : 318
Mean : 7220 25-JE_NJS: 300
3rd Qu.:10806 1-JC_NJS : 299
Max. :15259 19-JC_NNS: 290
(Other) :3823
GIVENNESS
Min. : 0.000
1st Qu.: 8.000
Median :10.000
Mean : 7.875
3rd Qu.:10.000
Max. :10.000
Quick reminder:
(baseline <- max( # make baseline the highest
prop.table( # proportion in the
table(d$SUBJREAL)))) # frequency table of the response
[1] 0.7260606
deviance(m_00 <- glm( # make/summarize the gen. linear model m_00:
SUBJREAL ~ 1, # SUBJREAL ~ an overall intercept (1) & no predictor
family=binomial, # the response is binary
data=d)) # those variables are in d
[1] 6781.442
Some exploration of the relevant variables:
< table of extent 0 >
1-JC_NJS 1-JC_NNS 10-JE_NJS 10-JE_NNS 11-JE_NJS 11-JE_NNS 16-JE_NJS 16-JE_NNS
299 266 247 236 195 250 359 189
18-JK_NJS 18-JK_NNS 19-JC_NJS 19-JC_NNS 2-JK_NJS 2-JK_NNS 24-JE_NJS 24-JE_NNS
386 77 160 290 222 278 234 173
25-JE_NJS 25-JE_NNS 26-JE_NJS 26-JE_NNS 7-JE_NJS 7-JE_NNS 8-JE_NJS 8-JE_NNS
300 220 179 248 277 188 318 184
# the predictor(s) w/ the response
spineplot(d$SUBJREAL ~ d$GIVENNESS)
table(d$CONTRAST, d$SUBJREAL)
no yes
no 4151 1368
yes 42 214
no yes
njs 2367 809
nns 1826 773
no yes
no njs 2344 716
nns 1807 652
yes njs 23 93
nns 19 121
2.6.1 Modeling & numerical evaluation
Let’s fit our initial regression model:
summary(m_01 <- glm( # make/summarize the gen. linear model m_01:
SUBJREAL ~ 1 + # SUBJREAL ~ an overall intercept (1)
GIVENNESS*CONTRAST*SPKR, # & these predictors & their interaction
family=binomial, # resp = binary
data=d)) # those vars are in d
Call:
glm(formula = SUBJREAL ~ 1 + GIVENNESS * CONTRAST * SPKR, family = binomial,
data = d)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.184249 0.102591 11.543 < 2e-16 ***
GIVENNESS -0.323963 0.012433 -26.057 < 2e-16 ***
CONTRASTyes 0.114811 0.362928 0.316 0.752
SPKRnns 0.109794 0.152180 0.721 0.471
GIVENNESS:CONTRASTyes 0.342195 0.050759 6.742 1.57e-11 ***
GIVENNESS:SPKRnns 0.007453 0.018296 0.407 0.684
CONTRASTyes:SPKRnns 0.587306 0.541679 1.084 0.278
GIVENNESS:CONTRASTyes:SPKRnns -0.053055 0.075272 -0.705 0.481
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6781.4 on 5774 degrees of freedom
Residual deviance: 4851.2 on 5767 degrees of freedom
AIC: 4867.2
Number of Fisher Scoring iterations: 4
This looks like we need to simplify things:
drop1(m_01, # drop each droppable predictor at a time from m_01 &
test="Chisq") %>% data.frame # test its significance w/ a LRT
Df Deviance AIC LRT Pr..Chi.
<none> NA 4851.219 4867.219 NA NA
GIVENNESS:CONTRAST:SPKR 1 4851.717 4865.717 0.4980584 0.4803545
Thus, we delete the predictor with the highest non-significant p-value and fit the 2nd model with that 3-way interaction GIVENNESS:CONTRAST:SPKR
deleted:
summary(m_02 <- update( # make m_02 an update of
m_01, .~. # m_01, namely all of it (.~.), but then
- GIVENNESS:CONTRAST:SPKR)) # remove this interaction
Call:
glm(formula = SUBJREAL ~ GIVENNESS + CONTRAST + SPKR + GIVENNESS:CONTRAST +
GIVENNESS:SPKR + CONTRAST:SPKR, family = binomial, data = d)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.173926 0.101287 11.590 <2e-16 ***
GIVENNESS -0.322515 0.012234 -26.363 <2e-16 ***
CONTRASTyes 0.248803 0.319542 0.779 0.436
SPKRnns 0.132500 0.148841 0.890 0.373
GIVENNESS:CONTRASTyes 0.317882 0.037457 8.487 <2e-16 ***
GIVENNESS:SPKRnns 0.004307 0.017754 0.243 0.808
CONTRASTyes:SPKRnns 0.297779 0.349738 0.851 0.395
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6781.4 on 5774 degrees of freedom
Residual deviance: 4851.7 on 5768 degrees of freedom
AIC: 4865.7
Number of Fisher Scoring iterations: 4
Resid..Df Resid..Dev Df Deviance Pr..Chi.
1 5767 4851.219 NA NA NA
2 5768 4851.717 -1 -0.4980584 0.4803545
drop1(m_02, # drop each droppable predictor at a time from m_02 &
test="Chisq") %>% data.frame # test its significance w/ a LRT
Df Deviance AIC LRT Pr..Chi.
<none> NA 4851.717 4865.717 NA NA
GIVENNESS:CONTRAST 1 4910.457 4922.457 58.73974047 1.799695e-14
GIVENNESS:SPKR 1 4851.776 4863.776 0.05881266 8.083825e-01
CONTRAST:SPKR 1 4852.445 4864.445 0.72825511 3.934494e-01
Let’s fit the 3rd model with that interaction GIVENNESS:SPKR
deleted:
summary(m_03 <- update( # make m_03 an update of
m_02, .~. # m_02, namely all of it (.~.), but then
- GIVENNESS:SPKR)) # remove this interaction
Call:
glm(formula = SUBJREAL ~ GIVENNESS + CONTRAST + SPKR + GIVENNESS:CONTRAST +
CONTRAST:SPKR, family = binomial, data = d)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.159897 0.082884 13.994 <2e-16 ***
GIVENNESS -0.320545 0.009124 -35.133 <2e-16 ***
CONTRASTyes 0.251807 0.318626 0.790 0.4294
SPKRnns 0.163595 0.075717 2.161 0.0307 *
GIVENNESS:CONTRASTyes 0.317899 0.037449 8.489 <2e-16 ***
CONTRASTyes:SPKRnns 0.289544 0.347981 0.832 0.4054
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6781.4 on 5774 degrees of freedom
Residual deviance: 4851.8 on 5769 degrees of freedom
AIC: 4863.8
Number of Fisher Scoring iterations: 4
Resid..Df Resid..Dev Df Deviance Pr..Chi.
1 5768 4851.717 NA NA NA
2 5769 4851.776 -1 -0.05881266 0.8083825
drop1(m_03, # drop each droppable predictor at a time from m_03 &
test="Chisq") %>% data.frame # test its significance w/ a LRT
Df Deviance AIC LRT Pr..Chi.
<none> NA 4851.776 4863.776 NA NA
GIVENNESS:CONTRAST 1 4910.533 4920.533 58.7577749 1.783275e-14
CONTRAST:SPKR 1 4852.471 4862.471 0.6953493 4.043508e-01
Let’s fit the 4th model with that interaction CONTRAST:SPKR
deleted:
summary(m_04 <- update( # make m_04 an update of
m_03, .~. # m_03, namely all of it (.~.), but then
- CONTRAST:SPKR)) # remove this interaction
Call:
glm(formula = SUBJREAL ~ GIVENNESS + CONTRAST + SPKR + GIVENNESS:CONTRAST,
family = binomial, data = d)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.153805 0.082515 13.983 <2e-16 ***
GIVENNESS -0.320586 0.009126 -35.131 <2e-16 ***
CONTRASTyes 0.401186 0.267828 1.498 0.1342
SPKRnns 0.177381 0.073887 2.401 0.0164 *
GIVENNESS:CONTRASTyes 0.316619 0.037359 8.475 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6781.4 on 5774 degrees of freedom
Residual deviance: 4852.5 on 5770 degrees of freedom
AIC: 4862.5
Number of Fisher Scoring iterations: 4
Resid..Df Resid..Dev Df Deviance Pr..Chi.
1 5769 4851.776 NA NA NA
2 5770 4852.471 -1 -0.6953493 0.4043508
drop1(m_04, # drop each droppable predictor at a time from m_04 &
test="Chisq") %>% data.frame # test its significance w/ a LRT
Df Deviance AIC LRT Pr..Chi.
<none> NA 4852.471 4862.471 NA NA
SPKR 1 4858.229 4866.229 5.757902 1.641466e-02
GIVENNESS:CONTRAST 1 4910.914 4918.914 58.442790 2.092892e-14
We’re done:
Estimate 2.5 % 97.5 %
(Intercept) 1.1538051 0.99373618 1.3173148
GIVENNESS -0.3205855 -0.33865477 -0.3028731
CONTRASTyes 0.4011863 -0.10095012 0.9545382
SPKRnns 0.1773806 0.03251689 0.3222221
GIVENNESS:CONTRASTyes 0.3166187 0.24257305 0.3897207
Let’s compute the overall significance test for this model:
Resid..Df Resid..Dev Df Deviance Pr..Chi.
1 5774 6781.442 NA NA NA
2 5770 4852.471 4 1928.971 0
pchisq( # compute the area under the chi-squared curve
q=6781.4-4852.5, # for this chi-squared value
df=4, # at 4 df
lower.tail=FALSE) # only using the right/upper tail/side
[1] 0
Let’s make ‘predictions’ and evaluate them:
d$PRED_PP_y <- predict( # make d$PRED_PP_y the result of predicting
m_final, # from m_final
type="response") # predicted probabilities of "yes"
# if you don't say type="response", predict gives you log odds
d$PRED_CAT <- factor( # make d$PRED_CAT the result of checking
ifelse(d$PRED_PP_y>=0.5, # if the predicted prob. of "yes" is >=0.5
levels(d$SUBJREAL)[2], # if yes, predict "yes"
levels(d$SUBJREAL)[1]), # otherwise, predict "no"
levels=levels(d$SUBJREAL)) # make sure both levels are registered
(c_m <- table( # confusion matrix: cross-tabulate
"OBS" =d$SUBJREAL, # observed choices in the rows
"PREDS"=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS no yes
no 3920 273
yes 629 953
c( # evaluate the confusion matrix
"Prec. for yes" =c_m["yes", "yes"] / sum(c_m[ , "yes"]),
"Acc./rec. for yes"=c_m["yes", "yes"] / sum(c_m["yes", ]),
"Prec. for no" =c_m[ "no","no"] / sum(c_m[ ,"no" ]),
"Acc./rec. for no" =c_m[ "no","no"] / sum(c_m[ "no", ]),
"Acc. (overall)" =mean(d$SUBJREAL==d$PRED_CAT))
Prec. for yes Acc./rec. for yes Prec. for no Acc./rec. for no
0.7773246 0.6024020 0.8617279 0.9348915
Acc. (overall)
0.8438095
McFadden R-squared Nagelkerke R-squared Cohen's kappa
0.2844485 0.4109678 0.5777748
C
0.8028431
This looks like a pretty decent model, not bad at all (esp. Cohen’s kappa and the C-score).
2.6.2 Interpretation
2.6.2.1 The main effect SPKR
Let’s visualize the nature of the effect of SPKR
based on the predictions from effects
:
(sp_d <- data.frame( # make sp_d a data frame of
sp <- effect( # the effect
"SPKR", # of SPKR
m_final))) # in m_final
SPKR fit se lower upper
1 njs 0.2239942 0.008865779 0.2070979 0.2418486
2 nns 0.2563248 0.010362191 0.2365475 0.2771554
# alternative w/ predictorEffect:
# predictorEffect( # the effect
# "SPKR", # of SPKR
# m_final) # in m_final
plot(sp, # plot the effect of SPKR
type="response", # but plot predicted probabilities
ylim=c(0, 1), # ylim: the whole range of possible probabilities
grid=TRUE) # & w/ a grid
2.6.2.2 The interaction GIVENNESS:CONTRAST
Let’s visualize the nature of the effect of GIVENNESS:CONTRAST
based on the predictions:
(gico_d <- data.frame( # make gico_d a data frame of
gico <- effect( # the effect
"GIVENNESS:CONTRAST", # of GIVENNESS:CONTRAST
xlevels=list(GIVENNESS=0:10), # w/ these values for GIVENNESS
m_final))) # in m_final
GIVENNESS CONTRAST fit se lower upper
1 0 no 0.7744540 0.013242416 0.7474477 0.7993477
2 1 no 0.7136236 0.013907147 0.6856094 0.7400878
3 2 no 0.6439301 0.013910734 0.6162240 0.6707048
4 3 no 0.5675538 0.013211328 0.5414985 0.5932406
5 4 no 0.4878254 0.011931698 0.4644833 0.5112207
6 5 no 0.4087117 0.010338719 0.3886138 0.4291195
7 6 no 0.3340599 0.008748289 0.3171384 0.3514197
8 7 no 0.2668889 0.007396797 0.2526440 0.2816343
9 8 no 0.2089857 0.006356385 0.1968004 0.2217172
10 9 no 0.1608880 0.005557064 0.1502920 0.1720798
11 10 no 0.1221505 0.004886624 0.1128915 0.1320558
12 0 yes 0.8368289 0.035072783 0.7560955 0.8945656
13 1 yes 0.8362865 0.031599462 0.7646761 0.8892613
14 2 yes 0.8357427 0.028528717 0.7719877 0.8843418
15 3 yes 0.8351974 0.026018512 0.7777201 0.8801035
16 4 yes 0.8346507 0.024259746 0.7814989 0.8769094
17 5 yes 0.8341025 0.023439157 0.7829586 0.8751170
18 6 yes 0.8335529 0.023671390 0.7818743 0.8749460
19 7 yes 0.8330018 0.024943101 0.7782714 0.8763688
20 8 yes 0.8324492 0.027123270 0.7724056 0.8791310
21 9 yes 0.8318952 0.030028119 0.7646364 0.8828778
22 10 yes 0.8313397 0.033481718 0.7553092 0.8872718
# alternative w/ predictorEffect:
# predictorEffect( # the effect
# "GIVENNESS", # of GIVENNESS:CONTRAST
# m_final, # in m_final
# focal.levels=0:10) # w/ these values for GIVENNESS
plot(gico, # plot the effect of GIVENNESS:CONTRAST
type="response", # but plot predicted probabilities
ylim=c(0, 1), # ylim: the whole range of possible probabilities
grid=TRUE) # & w/ a grid
This might be better with a one-panel kind of plot:
plot(gico, # plot the effect of GIVENNESS:CONTRAST
type="response", # but plot predicted probs
ylim=c(0, 1), grid=TRUE, # w/ these y-axis limits & w/ a grid
multiline=TRUE, # put all lines in one panel
confint=list(style="auto"), # keep confidence intervals
lattice=list(key.args=list( # make the
columns=2, cex.title=0))) # legend have 5 columns & no title
Quiz question: Is the slope of GIVENNESS
when CONTRAST
is yes significantly different from 0? In other words, when CONTRAST
is yes, is there still a significant effect of GIVENNESS
?
2.6.3 Write-up
This one’s for you >:)
R version 4.5.1 Patched (2025-06-20 r88332 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)
Matrix products: default
LAPACK version 3.12.1
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets compiler methods
[8] base
other attached packages:
[1] ggplot2_3.5.2 multcomp_1.4-28 TH.data_1.1-3 MASS_7.3-65
[5] survival_3.8-3 mvtnorm_1.3-3 emmeans_1.11.2 effects_4.2-4
[9] car_3.1-3 carData_3.0-5 STGmisc_1.01 Rcpp_1.1.0
[13] magrittr_2.0.3
loaded via a namespace (and not attached):
[1] gtable_0.3.6 xfun_0.52 htmlwidgets_1.6.4 insight_1.3.1
[5] lattice_0.22-7 crosstalk_1.2.1 vctrs_0.6.5 tools_4.5.1
[9] Rdpack_2.6.4 generics_0.1.4 sandwich_3.1-1 tibble_3.3.0
[13] pkgconfig_2.0.3 Matrix_1.7-3 data.table_1.17.8 RColorBrewer_1.1-3
[17] lifecycle_1.0.4 farver_2.1.2 mitools_2.4 codetools_0.2-20
[21] survey_4.4-2 htmltools_0.5.8.1 lazyeval_0.2.2 yaml_2.3.10
[25] Formula_1.2-5 plotly_4.11.0 tidyr_1.3.1 pillar_1.11.0
[29] nloptr_2.2.1 reformulas_0.4.1 boot_1.3-31 abind_1.4-8
[33] nlme_3.1-168 tidyselect_1.2.1 digest_0.6.37 purrr_1.1.0
[37] dplyr_1.1.4 splines_4.5.1 fastmap_1.2.0 grid_4.5.1
[41] colorspace_2.1-1 cli_3.6.5 dichromat_2.0-0.1 withr_3.0.2
[45] scales_1.4.0 estimability_1.5.1 httr_1.4.7 rmarkdown_2.29
[49] nnet_7.3-20 lme4_1.1-37 zoo_1.8-14 coda_0.19-4.1
[53] evaluate_1.0.4 knitr_1.50 rbibutils_2.3 viridisLite_0.4.2
[57] rlang_1.1.6 xtable_1.8-4 glue_1.8.0 DBI_1.2.3
[61] rstudioapi_0.17.1 minqa_1.2.8 jsonlite_2.0.0 R6_2.6.1