1.1 Introduction: the ‘incompleteness’ of regression summaries
In this session, we will look at how to get more detail out of models and their summaries. The first part, Section 1.2, is concerned with post-hoc comparisons while the second part, Section 1.3, deals with a priori motivated comparisons (which often are not even straightforwardly approachable from an existing model summary). But as an introduction, let’s consider the issue of how much of what might be interesting a regression summary actually does (not) show, for which we briefly consider the following linear regression model on the duration data using just two predictors:
SURPRISAL_s, a square-root transformed surprisal value;
CLASS: a categorical predictor categorizing words in terms of 5 different levels:
d <-read.delim("_input/wdurations.csv", stringsAsFactors=TRUE)d$DURATION_l <-log2(d$DURATION); d$SURPRISAL_s <-sqrt(d$SURPRISAL); summary(d[,c(12:13,10)])
DURATION_l SURPRISAL_s CLASS
Min. : 3.807 Min. :0.000 closed :4021
1st Qu.: 6.781 1st Qu.:2.000 number : 125
Median : 7.340 Median :2.449 open :2086
Mean : 7.348 Mean :2.371 propname: 110
3rd Qu.: 7.937 3rd Qu.:2.828 unclear : 41
Max. :10.456 Max. :4.583
m <-lm(DURATION_l ~1+ SURPRISAL_s*CLASS, data=d)
Consider all the questions that one might have with regard to just the interaction of SURPRISAL_s:CLASS. Once you think about it for a moment, I hope you realize that the questions one might have are potentially all of these:
The slope of SURPRISAL_s when CLASS is closed,
what is it?
is it significantly different from 0?
is it significantly different from the slope of SURPRISAL_s when CLASS is number?
is it significantly different from the slope of SURPRISAL_s when CLASS is open?
is it significantly different from the slope of SURPRISAL_s when CLASS is propname?
is it significantly different from the slope of SURPRISAL_s when CLASS is unclear?
The slope of SURPRISAL_s when CLASS is number,
what is it?
is it significantly different from 0?
is it significantly different from the slope of SURPRISAL_s when CLASS is open?
is it significantly different from the slope of SURPRISAL_s when CLASS is propname?
is it significantly different from the slope of SURPRISAL_s when CLASS is unclear?
The slope of SURPRISAL_s when CLASS is open,
what is it?
is it significantly different from 0?
is it significantly different from the slope of SURPRISAL_s when CLASS is propname?
is it significantly different from the slope of SURPRISAL_s when CLASS is unclear?
The slope of SURPRISAL_s when CLASS is propname,
what is it?
is it significantly different from 0?
is it significantly different from the slope of SURPRISAL_s when CLASS is unclear?
The slope of SURPRISAL_s when CLASS is unclear,
what is it?
is it significantly different from 0?
Again, these are just the most basic questions that would probably be of interest – I am leaving out more specialized ones based on conflations and possibly interesting tests for specific values.
But now consider again what most people provide, which, if we’re lucky, is this:
You see there’s a big discrepancy. Which of the above questions does this output answer (directly or after one addition)?
when CLASS is closed, this answers all 6 questions (a to f);
when CLASS is number, this answers 1 out of 5 questions (a);
when CLASS is open, this answers 1 out of 4 questions (a);
when CLASS is propname, this answers 1 out of 3 questions (a);
when CLASS is unclear, this answers 1 out of 2 questions (a).
Not a great result: we get 10 answers for minimally 20 questions and for many of the as yet unanswered questions you may not even know yet how to get them – this session and the next will change this!
We begin by re-loading and re-preparing the same duration data; this time we’re centering the surprisal predictor:
In the interest of covering everything – all 20 questions from above and more – in excruciating detail, we’ll go over nine kinds of scenarios (yes, nine, you can thank me later …) which arise from the combination of two three-possibility scenarios:
Table 1: 9 scenarios relevant for post-hoc exploration of models
a: what are the means/slopes?
b: are they * diff from 0?
c: are they * diff from each other?
1: means for (combos of) levels of cat preds
scenario 1a
scenario 1b
scenario 1c
2: means for values of numeric preds
scenario 2a
scenario 2b
scenario 2c
3: slopes for (combos of) levels of cat preds
scenario 3a
scenario 3b
scenario 3c
Yes, this seems annoying but each of the three questions in the columns has potentially important implications, and the way they are tackled depends on what one looks at: means, slopes, etc. And, maybe somewhat confusingly once you recognize it, none of these is fully answered by the summary output and hardly any study I see discusses all three.
1.2 Post-hoc comparisons
1.2.1 Scen. 1a: means for cat
Scenario 1a means to ask the following question: When SURPRISAL_sc is controlled for, what is each of the 10 predicted means of DURATION_l for when you cross the 2 levels of VOWELS and the 5 levels of CLASS? A simple way to answer all scenario 1 questions and, maybe more generally, to make sense of especially the coefficients of models that I often use is useful here as well: We create a data frame nd (for ‘new data’) that contains all variables in the rhs of the model formula and crosses
all levels of all categorical variables in the rhs (here VOWELS and CLASS);
the values of 0 and 1 for all numeric variables in the rhs and when the numeric predictors are not centered, I usually also include the mean value for each numeric variable:
nd <-expand.grid( # generate all combinations ofSURPRISAL_sc=0:1, # main values of SURPRISAL_scCLASS=levels(d$CLASS), # all levels of CLASSVOWELS=levels(d$VOWELS)) %>%# all levels of VOWELS, then"["(,3:1) # reorder columns
To that data frame, we then add the predictions the final model makes for all these combinations and, when easily available, also their confidence intervals; we can then just look at the whole data frame or create different tabular versions of it – here’s one I find useful (because the rows let us see the predictions for SURPRISAL_sc 0 and 1 well, the difference of which is the slope of SURPRISAL_sc):
VOWELS CLASS SURPRISAL_sc fit lwr upr
1 one closed 0 7.140473 7.114057 7.166890
2 one closed 1 7.581558 7.528438 7.634679
3 one number 0 7.907337 7.739671 8.075002
4 one number 1 8.023885 7.821621 8.226149
5 one open 0 7.643108 7.574309 7.711908
6 one open 1 8.160513 8.085945 8.235081
7 one propname 0 8.443379 8.173478 8.713281
8 one propname 1 8.521196 8.382297 8.660096
9 one unclear 0 7.960958 7.662950 8.258965
10 one unclear 1 7.665738 7.278268 8.053208
11 two+ closed 0 6.983092 6.937840 7.028343
12 two+ closed 1 7.240932 7.157581 7.324282
13 two+ number 0 7.932180 7.655043 8.209317
14 two+ number 1 7.865484 7.567353 8.163614
15 two+ open 0 7.795353 7.755931 7.834775
16 two+ open 1 8.129512 8.088166 8.170859
17 two+ propname 0 8.605884 8.187216 9.024552
18 two+ propname 1 8.500456 8.154602 8.846310
19 two+ unclear 0 8.039396 7.433469 8.645323
20 two+ unclear 1 7.560931 6.961012 8.160851
tapply(FUN=c, X=nd$fit, # apply the function c to the predictionsINDEX=list( # which are grouped as follows:CLASS=nd$CLASS, # rowsSURPRISAL_sc=nd$SURPRISAL_sc, # columnsVOWELS=nd$VOWELS)) # slices
, , VOWELS = one
SURPRISAL_sc
CLASS 0 1
closed 7.140473 7.581558
number 7.907337 8.023885
open 7.643108 8.160513
propname 8.443379 8.521196
unclear 7.960958 7.665738
, , VOWELS = two+
SURPRISAL_sc
CLASS 0 1
closed 6.983092 7.240932
number 7.932180 7.865484
open 7.795353 8.129512
propname 8.605884 8.500456
unclear 8.039396 7.560931
In this case, half of these are what we get from effect, which is one of the most efficient ways to answer 1a questions. This is because effect conveniently holds the predictors not included in the term at ‘typical values’, which for the centered (!) values of SURPRISAL_sc means the effects values are those in nd when SURPRISAL_sc is 0:
VOWELS CLASS fit se lower upper
1 one closed 7.140473 0.01347541 7.114057 7.166890
2 two+ closed 6.983092 0.02308346 6.937840 7.028343
3 one number 7.907337 0.08552890 7.739671 8.075002
4 two+ number 7.932180 0.14137211 7.655043 8.209317
5 one open 7.643108 0.03509579 7.574309 7.711908
6 two+ open 7.795353 0.02010977 7.755931 7.834775
7 one propname 8.443379 0.13768123 8.173478 8.713281
8 two+ propname 8.605884 0.21356943 8.187216 9.024552
9 one unclear 7.960958 0.15201852 7.662950 8.258965
10 two+ unclear 8.039396 0.30909339 7.433469 8.645323
1.2.2 Scen. 1b: means for cat vs. 0
Scenario 1b means to ask whether each of the 10 means computed above is significantly different from 0 (and what the exact p-value is). This can so far not be answered fully from the summary output of the model. We can see from nd that none of the predicted means’ confidence intervals overlaps with 0, but we don’t have exact p-values. Right now, we know the answer to this question only for 1 of the 10 means, the one represented by the intercept in summary(m_01)$coefficients: CLASS: closed with VOWELS: one, which clearly is significantly different from 0. With your current knowledge, the ‘best’ way to proceed for you might be to generate m_01 9 more times, each time with CLASS and VOWELS re-leveled in such ways that the intercept comes to represent all 10 possible combinations – clearly not a desirable option, which is why we will discuss a much better approach another time.
1.2.3 Scen. 1c: means for cat vs. each other
Scenario 1c means to ask whether (some? all?) differences between the 10 means are significant. For that, we can use emmeans::emmeans and pairs. For a 2-way interaction like this, there are three options – two of them (options a and b) are usually/often pretty good/useful and justifiable, the last one (option c) often not so much.
Option a here is to make comparisons of the predicted means for the levels of CLASS within the levels of VOWELS:
pairs(emmeans( # compute pairwise comparisons of meansobject=m_01, # for the model m_01specs=~ CLASS | VOWELS), # of CLASS within each level of VOWELS# you could also write this like this:# specs="CLASS", # of CLASS# by="VOWELS"), # within each level of VOWELSadjust="none") # do not adjust for multiple comparisons
VOWELS = one:
contrast estimate SE df t.ratio p.value
closed - number -0.7669 0.0867 6367 -8.848 <0.0001
closed - open -0.5026 0.0382 6367 -13.155 <0.0001
closed - propname -1.3029 0.1380 6367 -9.413 <0.0001
closed - unclear -0.8205 0.1530 6367 -5.375 <0.0001
number - open 0.2642 0.0921 6367 2.869 0.0041
number - propname -0.5360 0.1620 6367 -3.309 0.0009
number - unclear -0.0536 0.1740 6367 -0.307 0.7585
open - propname -0.8003 0.1420 6367 -5.645 <0.0001
open - unclear -0.3178 0.1560 6367 -2.039 0.0415
propname - unclear 0.4824 0.2050 6367 2.353 0.0187
VOWELS = two+:
contrast estimate SE df t.ratio p.value
closed - number -0.9491 0.1440 6367 -6.606 <0.0001
closed - open -0.8123 0.0312 6367 -25.993 <0.0001
closed - propname -1.6228 0.2160 6367 -7.523 <0.0001
closed - unclear -1.0563 0.3100 6367 -3.405 0.0007
number - open 0.1368 0.1430 6367 0.960 0.3373
number - propname -0.6737 0.2550 6367 -2.641 0.0083
number - unclear -0.1072 0.3400 6367 -0.316 0.7522
open - propname -0.8105 0.2140 6367 -3.786 0.0002
open - unclear -0.2440 0.3100 6367 -0.788 0.4306
propname - unclear 0.5665 0.3750 6367 1.512 0.1306
You can see that these results nicely map onto our data frame nd; here are the first two:
round(nd$fit[1] - nd$fit[3], 4)
[1] -0.7669
round(nd$fit[1] - nd$fit[5], 4)
[1] -0.5026
But note that we should probably adjust for multiple comparisons because the high number of post-hoc comparisons we’re making increases the probability of false positives. The Holm method is a useful and not too conservative way to do that:
pairs(emmeans( # compute pairwise comparisons of meansobject=m_01, # for the model m_01specs=~ CLASS | VOWELS), # of CLASS within each level of VOWELSadjust="holm") # adjust for multiple comparisons w/ Holm's method
VOWELS = one:
contrast estimate SE df t.ratio p.value
closed - number -0.7669 0.0867 6367 -8.848 <0.0001
closed - open -0.5026 0.0382 6367 -13.155 <0.0001
closed - propname -1.3029 0.1380 6367 -9.413 <0.0001
closed - unclear -0.8205 0.1530 6367 -5.375 <0.0001
number - open 0.2642 0.0921 6367 2.869 0.0165
number - propname -0.5360 0.1620 6367 -3.309 0.0047
number - unclear -0.0536 0.1740 6367 -0.307 0.7585
open - propname -0.8003 0.1420 6367 -5.645 <0.0001
open - unclear -0.3178 0.1560 6367 -2.039 0.0830
propname - unclear 0.4824 0.2050 6367 2.353 0.0560
VOWELS = two+:
contrast estimate SE df t.ratio p.value
closed - number -0.9491 0.1440 6367 -6.606 <0.0001
closed - open -0.8123 0.0312 6367 -25.993 <0.0001
closed - propname -1.6228 0.2160 6367 -7.523 <0.0001
closed - unclear -1.0563 0.3100 6367 -3.405 0.0040
number - open 0.1368 0.1430 6367 0.960 1.0000
number - propname -0.6737 0.2550 6367 -2.641 0.0414
number - unclear -0.1072 0.3400 6367 -0.316 1.0000
open - propname -0.8105 0.2140 6367 -3.786 0.0011
open - unclear -0.2440 0.3100 6367 -0.788 1.0000
propname - unclear 0.5665 0.3750 6367 1.512 0.5222
P value adjustment: holm method for 10 tests
For the most part, I will continue to not adjust for multiple tests, but that’s also only for didactic reasons, namely so you can make comparisons to the models’ summaries – in real applications, adjusting for multiple tests is often useful if not required.
Option b here is the reverse perspective from the last one: to make comparisons of the predicted means for the levels of VOWELS within the levels of CLASS:
pairs(emmeans( # compute pairwise comparisons of meansobject=m_01, # for the model m_01specs=~ VOWELS | CLASS), # of VOWELS within each level of CLASS# you could also write this like this:# specs="VOWELS", # of VOWELS# by="CLASS"), # within each level of CLASSadjust="none") # do not adjust for multiple comparisons
CLASS = closed:
contrast estimate SE df t.ratio p.value
one - (two+) 0.1574 0.0264 6367 5.951 <0.0001
CLASS = number:
contrast estimate SE df t.ratio p.value
one - (two+) -0.0248 0.1490 6367 -0.167 0.8674
CLASS = open:
contrast estimate SE df t.ratio p.value
one - (two+) -0.1522 0.0387 6367 -3.929 <0.0001
CLASS = propname:
contrast estimate SE df t.ratio p.value
one - (two+) -0.1625 0.1930 6367 -0.844 0.3988
CLASS = unclear:
contrast estimate SE df t.ratio p.value
one - (two+) -0.0784 0.3040 6367 -0.258 0.7963
Again, this maps perfectly onto what we’d see in nd; here are the first two:
round(nd$fit[1] - nd$fit[11], 4)
[1] 0.1574
round(nd$fit[3] - nd$fit[13], 4)
[1] -0.0248
Option c, the not recommended one, is to make all possible comparisons across all combinations of levels – fishing for results by considering even comparisons no one would ever postulate a priori; for that extreme fishing, adjusting for multiple comparisons is pretty much obligatory.
pairs(emmeans( # compute pairwise comparisons of meansobject=m_01, # for the model m_01specs=~ VOWELS:CLASS), # of VOWELS:CLASSadjust="holm") # adjust for multiple comparisons w/ Holm's method
contrast estimate SE df t.ratio p.value
one closed - (two+ closed) 0.1574 0.0264 6367 5.951 <0.0001
one closed - one number -0.7669 0.0867 6367 -8.848 <0.0001
one closed - (two+ number) -0.7917 0.1420 6367 -5.581 <0.0001
one closed - one open -0.5026 0.0382 6367 -13.155 <0.0001
one closed - (two+ open) -0.6549 0.0239 6367 -27.381 <0.0001
one closed - one propname -1.3029 0.1380 6367 -9.413 <0.0001
one closed - (two+ propname) -1.4654 0.2140 6367 -6.858 <0.0001
one closed - one unclear -0.8205 0.1530 6367 -5.375 <0.0001
one closed - (two+ unclear) -0.8989 0.3090 6367 -2.906 0.0807
(two+ closed) - one number -0.9242 0.0883 6367 -10.462 <0.0001
(two+ closed) - (two+ number) -0.9491 0.1440 6367 -6.606 <0.0001
(two+ closed) - one open -0.6600 0.0404 6367 -16.330 <0.0001
(two+ closed) - (two+ open) -0.8123 0.0312 6367 -25.993 <0.0001
(two+ closed) - one propname -1.4603 0.1390 6367 -10.476 <0.0001
(two+ closed) - (two+ propname) -1.6228 0.2160 6367 -7.523 <0.0001
(two+ closed) - one unclear -0.9779 0.1540 6367 -6.363 <0.0001
(two+ closed) - (two+ unclear) -1.0563 0.3100 6367 -3.405 0.0167
one number - (two+ number) -0.0248 0.1490 6367 -0.167 1.0000
one number - one open 0.2642 0.0921 6367 2.869 0.0866
one number - (two+ open) 0.1120 0.0880 6367 1.273 1.0000
one number - one propname -0.5360 0.1620 6367 -3.309 0.0226
one number - (two+ propname) -0.6985 0.2300 6367 -3.031 0.0563
one number - one unclear -0.0536 0.1740 6367 -0.307 1.0000
one number - (two+ unclear) -0.1321 0.3210 6367 -0.412 1.0000
(two+ number) - one open 0.2891 0.1460 6367 1.976 0.7231
(two+ number) - (two+ open) 0.1368 0.1430 6367 0.960 1.0000
(two+ number) - one propname -0.5112 0.1980 6367 -2.588 0.1840
(two+ number) - (two+ propname) -0.6737 0.2550 6367 -2.641 0.1655
(two+ number) - one unclear -0.0288 0.2080 6367 -0.139 1.0000
(two+ number) - (two+ unclear) -0.1072 0.3400 6367 -0.316 1.0000
one open - (two+ open) -0.1522 0.0387 6367 -3.929 0.0023
one open - one propname -0.8003 0.1420 6367 -5.645 <0.0001
one open - (two+ propname) -0.9628 0.2180 6367 -4.420 0.0003
one open - one unclear -0.3178 0.1560 6367 -2.039 0.6641
one open - (two+ unclear) -0.3963 0.3120 6367 -1.272 1.0000
(two+ open) - one propname -0.6480 0.1390 6367 -4.654 <0.0001
(two+ open) - (two+ propname) -0.8105 0.2140 6367 -3.786 0.0040
(two+ open) - one unclear -0.1656 0.1530 6367 -1.080 1.0000
(two+ open) - (two+ unclear) -0.2440 0.3100 6367 -0.788 1.0000
one propname - (two+ propname) -0.1625 0.1930 6367 -0.844 1.0000
one propname - one unclear 0.4824 0.2050 6367 2.353 0.3174
one propname - (two+ unclear) 0.4040 0.3390 6367 1.193 1.0000
(two+ propname) - one unclear 0.6449 0.2620 6367 2.458 0.2518
(two+ propname) - (two+ unclear) 0.5665 0.3750 6367 1.512 1.0000
one unclear - (two+ unclear) -0.0784 0.3040 6367 -0.258 1.0000
P value adjustment: holm method for 45 tests
Now, what if you had only been interested in a subset of the comparisons that pairs(emmeans(...)) makes? If you were interested in only, say, 5 comparisons, you don’t want pairs to adjust for 10. In such cases, you can extract the 5 non-adjusted p-values you’re interested in and correct them yourself only for the 5 tests you’re making. Imagine you were only interested in the comparisons 11 to 15 shown in the following output:
(qwe <-pairs(emmeans( # compute pairwise comparisons of meansobject=m_01, # for the model m_01specs=~ CLASS | VOWELS), # of CLASS within each level of VOWELSadjust="none")) # do not adjust for multiple comparisons
VOWELS = one:
contrast estimate SE df t.ratio p.value
closed - number -0.7669 0.0867 6367 -8.848 <0.0001
closed - open -0.5026 0.0382 6367 -13.155 <0.0001
closed - propname -1.3029 0.1380 6367 -9.413 <0.0001
closed - unclear -0.8205 0.1530 6367 -5.375 <0.0001
number - open 0.2642 0.0921 6367 2.869 0.0041
number - propname -0.5360 0.1620 6367 -3.309 0.0009
number - unclear -0.0536 0.1740 6367 -0.307 0.7585
open - propname -0.8003 0.1420 6367 -5.645 <0.0001
open - unclear -0.3178 0.1560 6367 -2.039 0.0415
propname - unclear 0.4824 0.2050 6367 2.353 0.0187
VOWELS = two+:
contrast estimate SE df t.ratio p.value
closed - number -0.9491 0.1440 6367 -6.606 <0.0001
closed - open -0.8123 0.0312 6367 -25.993 <0.0001
closed - propname -1.6228 0.2160 6367 -7.523 <0.0001
closed - unclear -1.0563 0.3100 6367 -3.405 0.0007
number - open 0.1368 0.1430 6367 0.960 0.3373
number - propname -0.6737 0.2550 6367 -2.641 0.0083
number - unclear -0.1072 0.3400 6367 -0.316 0.7522
open - propname -0.8105 0.2140 6367 -3.786 0.0002
open - unclear -0.2440 0.3100 6367 -0.788 0.4306
propname - unclear 0.5665 0.3750 6367 1.512 0.1306
Then this is how you would proceed because the function p.adjust determines how many tests to adjust for by looking at the length of the p argument, here 5:
To deal with the 3 scenarios 2a, 2b, and 2c, we now look at a new model m_02, which is slightly simpler (we leave out VOWELS) and fit slightly differently; importantly, note that SURPRISAL is now square-root transformed, but not centered:
d <-read.delim("_input/wdurations.csv", stringsAsFactors=TRUE)d$DURATION_l <-log2(d$DURATION) # we log DURATION againd$SURPRISAL_s <-sqrt(d$SURPRISAL) # SURPRISAL is now not centered(m_02 <-lm(data=d, formula=DURATION_l ~1+ SURPRISAL_s*CLASS)) %>% { list("Summary"=summary(.),"Drop 1"=drop1(., test="F") %>% as.data.frame) }
$Summary
Call:
lm(formula = DURATION_l ~ 1 + SURPRISAL_s * CLASS, data = d)
Residuals:
Min 1Q Median 3Q Max
-2.91777 -0.42367 0.00454 0.44474 2.85643
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.16980 0.03707 166.424 < 2e-16 ***
SURPRISAL_s 0.39268 0.01697 23.142 < 2e-16 ***
CLASSnumber 1.58940 0.36083 4.405 1.08e-05 ***
CLASSopen 0.70446 0.07590 9.281 < 2e-16 ***
CLASSpropname 2.16824 0.38200 5.676 1.44e-08 ***
CLASSunclear 2.57069 0.72650 3.538 0.000405 ***
SURPRISAL_s:CLASSnumber -0.32626 0.12879 -2.533 0.011322 *
SURPRISAL_s:CLASSopen -0.01876 0.02847 -0.659 0.509898
SURPRISAL_s:CLASSpropname -0.33912 0.10922 -3.105 0.001911 **
SURPRISAL_s:CLASSunclear -0.71745 0.25946 -2.765 0.005705 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6849 on 6373 degrees of freedom
Multiple R-squared: 0.3762, Adjusted R-squared: 0.3754
F-statistic: 427.1 on 9 and 6373 DF, p-value: < 2.2e-16
$`Drop 1`
Df Sum of Sq RSS AIC F value Pr(>F)
<none> NA NA 2989.676 -4821.334 NA NA
SURPRISAL_s:CLASS 4 10.84202 3000.518 -4806.228 5.777897 0.0001224216
Scenario 2a means to ask the following question: what are the 5 predicted means of DURATION_l for each of the levels of CLASS and when SURPRISAL_s is controlled for in some way? The summary output shows only 1 of the 5, the one represented by the intercept in summary(m_02)$coefficients, i.e. when CLASS is closed and SURPRISAL_s is 0. If we want the means for all 5 levels of CLASS for, for instance, when SURPRISAL_s is 0, with the present model we can compute them like this:
But the nd approach is sometimes more intuitive: Here are the predicted means of DURATION_l for the levels of CLASS when the predictor SURPRISAL_s is 0 (the intercept), 1 (for the slope difference), and its mean (2.3708677)
nd <-expand.grid(SURPRISAL_s=c(0, 1, mean(d$SURPRISAL_s)), # main values of SURPRISAL_sCLASS=levels(d$CLASS)) %>%# all levels of CLASS, then"["(,2:1) # reorder columnsnd <-cbind(nd, predict( m_02, newdata=nd, interval="confidence"))nd[nd$SURPRISAL_s==0,] # only part of nd, when SURP_s is 0
CLASS SURPRISAL_s fit lwr upr
1 closed 0 6.169798 6.097123 6.242474
4 number 0 7.759196 7.055597 8.462795
7 open 0 6.874254 6.744415 7.004092
10 propname 0 8.338038 7.592729 9.083347
13 unclear 0 8.740491 7.318164 10.162818
We can also get those from effect like this, but …
(cl_d <-data.frame( cl <-effect("CLASS", m_02)))
CLASS fit se lower upper
1 closed 7.100787 0.01180527 7.077645 7.123930
2 number 7.916657 0.07969515 7.760428 8.072886
3 open 7.760762 0.01820300 7.725078 7.796446
4 propname 8.465014 0.13552532 8.199338 8.730689
5 unclear 7.970505 0.14905516 7.678307 8.262703
… you can see we’re getting a warning because CLASS is in the interaction with SURPRISAL_s so we can’t just ignore it. Well, we can, kinda, but only because effect is smart enough to give us the predictions for the 5 levels of CLASS while SURPRISAL_s is held constant at its mean (which now, that we using the non-centered d$SURPRISAL_s, is not 0 but 2.3708677). But the better way to do this would be like this:
sucl_d <-data.frame( sucl <-effect("SURPRISAL_s:CLASS", m_02,xlevels=list(SURPRISAL_s=c(0:1, mean(d$SURPRISAL_s))))) %>%"["(,c(2,1,3,5,6))# alternative:# sucl_d <- data.frame(# sucl <- predictorEffect(# "SURPRISAL_s", m_02,# focal.levels=c(0:1, mean(d$SURPRISAL_s))))%>% "["(,c(2,1,3,5,6))sucl_d[sucl_d$SURPRISAL_s==0,] # only part of nd, when SURP_s is 0
CLASS SURPRISAL_s fit lower upper
1 closed 0 6.169798 6.097123 6.242474
4 number 0 7.759196 7.055597 8.462795
7 open 0 6.874254 6.744415 7.004092
10 propname 0 8.338038 7.592729 9.083347
13 unclear 0 8.740491 7.318164 10.162818
1.2.5 Scen. 2b: means for num vs. 0
Scenario 2b means to ask whether each of the 5 means computed above is significantly different from 0 (and what the exact p-value is). This can so far not be answered fully from the summary output of the model. We can see from nd that none of the predicted means’ confidence intervals overlaps with 0, but we don’t have p-values. Right now, we know the answer to this question only for 1 of the 10 means, the one represented by the intercept in summary(m_02)$coefficients, i.e. when CLASS: closed and SURPRISAL_s is 0, which clearly is significantly different from 0. With your current knowledge, the ‘best’ way to proceed for you would be to generate m_02 4 more times, each time with CLASS re-leveled in such way that the intercept comes to represent all 5 possible levels – clearly not a desirable option, which is why we will discuss a much better approach another time.
1.2.6 Scen. 2c: means for num vs. each other
Scenario 2c means to ask whether (some? all?) differences between the 5 means are significant. For that, we can use emmeans again:
(qwe <-pairs(emmeans( # compute pairwise comparisons of meansobject=m_02, # for the model m_02specs=~ CLASS | SURPRISAL_s), # of CLASS SURPRISAL_s is its meanadjust="none")) # do not adjust for multiple comparisons
SURPRISAL_s = 2.37:
contrast estimate SE df t.ratio p.value
closed - number -0.8159 0.0806 6373 -10.127 <0.0001
closed - open -0.6600 0.0217 6373 -30.419 <0.0001
closed - propname -1.3642 0.1360 6373 -10.028 <0.0001
closed - unclear -0.8697 0.1500 6373 -5.817 <0.0001
number - open 0.1559 0.0817 6373 1.907 0.0566
number - propname -0.5484 0.1570 6373 -3.488 0.0005
number - unclear -0.0538 0.1690 6373 -0.319 0.7501
open - propname -0.7043 0.1370 6373 -5.150 <0.0001
open - unclear -0.2097 0.1500 6373 -1.397 0.1625
propname - unclear 0.4945 0.2010 6373 2.455 0.0141
Note, these results do not map onto the results from the summary table or the values from nd! From the summary table, we would expect the difference between CLASSclosed and CLASSopen to be this:
summary(m_02)$coefficients[4,"Estimate"]
[1] 0.7044554
Same from nd:
nd$fit[7] - nd$fit[1]
[1] 0.7044554
But that’s not what pairs(emmeans(...)) says:
data.frame(qwe)[2,-2]
contrast estimate SE df t.ratio p.value
2 closed - open -0.6599747 0.02169594 6373 -30.41927 6.723276e-190
That is because the summary of the model and the part of nd we looked at above returns predictions when SURPRISAL_s is 0 whereas pairs(emmeans(...)) returns predictions when SURPRISAL_s is its mean, namely 2.3708677. Thus, these results would be identical if SURPRISAL had been centered but are not this time. Without centering, things fit only if we look at the ‘right’ part of nd, and that is of course why nd was above created that way:
nd$fit[3] - nd$fit[9]
[1] -0.6599747
data.frame(qwe)[2,]
contrast SURPRISAL_s estimate SE df t.ratio p.value
2 closed - open 2.370868 -0.6599747 0.02169594 6373 -30.41927 6.723276e-190
But can we get predictions/comparisons that match the summary output, i.e. for when SURPRISAL_s is 0, from emmeans as well? Yes, because emmeans can take an at argument:
pairs(emmeans( # compute pairwise comparisons of meansobject=m_02, # for the model m_02specs=~ CLASS | SURPRISAL_s, # of CLASS SURPRISAL_s is ...at=list(SURPRISAL_s=0)), # set to 0, not its mean!adjust="none") # do not adjust for multiple comparisons
SURPRISAL_s = 0:
contrast estimate SE df t.ratio p.value
closed - number -1.589 0.3610 6373 -4.405 <0.0001
closed - open -0.704 0.0759 6373 -9.281 <0.0001
closed - propname -2.168 0.3820 6373 -5.676 <0.0001
closed - unclear -2.571 0.7260 6373 -3.538 0.0004
number - open 0.885 0.3650 6373 2.425 0.0154
number - propname -0.579 0.5230 6373 -1.107 0.2683
number - unclear -0.981 0.8090 6373 -1.212 0.2255
open - propname -1.464 0.3860 6373 -3.793 0.0002
open - unclear -1.866 0.7290 6373 -2.562 0.0104
propname - unclear -0.402 0.8190 6373 -0.491 0.6232
1.2.7 Scen. 3a: slopes for num
Let’s continue with m_02, because it has the categorical predictor CLASS, whose predicted means we just looked at, but it also has a numeric predictor SURPRISAL_s, and since there is a significant interaction between those two predictors, our model involves 5 different slopes of SURPRISAL_s, one for each level of CLASS, with some significant differences between them.
Scenario 3a means to ask the following question: what are the 5 predicted slopes of SURPRISAL_s for each of the levels of CLASS? We can compute those slopes manually from the coefficients of m_02:
the slope of SURPRISAL_s when CLASS is closed is listed in the summary table as the 2nd coefficient, the one called SURPRISAL_s, because that’s the slope of the predictor when CLASS is its first level: (coef(m_02)[2], i.e. 0.3926787);
the slope of SURPRISAL_s when CLASS is number is that 2nd coefficient plus the 7th coefficient, the one called SURPRISAL_s:CLASSnumber, i.e. the adjustment to the slope of SURPRISAL_s when CLASS is not closed but number: coef(m_02)[2] + coef(m_02)[7], i.e. 0.066415;
the slope of SURPRISAL_s when CLASS is open is that 2nd coefficient plus the 8th coefficient, the one called SURPRISAL_s:CLASSopen, i.e. the adjustment to the slope of SURPRISAL_s when CLASS is not closed but open: coef(m_02)[2] + coef(m_02)[8], i.e. 0.3739173;
the slope of SURPRISAL_s when CLASS is propname is that 2nd coefficient plus the 9th coefficient, the one called SURPRISAL_s:CLASSpropname, i.e. the adjustment to the slope of SURPRISAL_s when CLASS is not closed but propname: coef(m_02)[2] + coef(m_02)[9], i.e. 0.0535565;
the slope of SURPRISAL_s when CLASS is unclear is that 2nd coefficient plus the 10th coefficient, the one called SURPRISAL_s:CLASSunclear, i.e. the adjustment to the slope of SURPRISAL_s when CLASS is not closed but unclear: coef(m_02)[2] + coef(m_02)[10], i.e. -0.3247698.
Thus, we could immediately get the slopes like this:
tapply(nd$fit, # apply a grouping of the predictionslist(nd$CLASS, # by CLASS nd$SURPRISAL_s), # and SURPRISAL_s c) %>%# just list the values,"["(,-3) %>%# suppress the 3rd columnapply(1, diff) # compute the differences
closed number open propname unclear
0.3926787 0.0664150 0.3739173 0.0535565 -0.3247698
But the most practical solution might be the function emmeans::emtrends:
emtrends( # compute the trends/slopesobject=m_02, # for the model m_02specs="CLASS", # compute comparisons over this predictorvar="SURPRISAL_s") # namely all slopes of this predictor
CLASS SURPRISAL_s.trend SE df lower.CL upper.CL
closed 0.3927 0.0170 6373 0.359 0.426
number 0.0664 0.1280 6373 -0.184 0.317
open 0.3739 0.0229 6373 0.329 0.419
propname 0.0536 0.1080 6373 -0.158 0.265
unclear -0.3248 0.2590 6373 -0.832 0.183
Confidence level used: 0.95
1.2.8 Scen. 3b: slopes for num vs. 0
Scenario 3b means to ask whether each of the 5 slopes just computed is significantly different from 0 (and what the exact p-value is). This can so far not be answered from the summary output of the model. Right now, we know the answer to this question only for 1 of the 5 slopes, the one represented by the 2nd coefficient (i.e., when CLASS: closed). With your current knowledge, the ‘best’ way to proceed for you would be to generate m_01 4 more times, each time with CLASS re-leveled in such way that the 2nd coefficient comes to represent all 5 possible values – clearly not a desirable option, which is why we will discuss a much better approach another time.
1.2.9 Scen. 3c: slopes for num vs. each other
Scenario 3c means to ask whether (some? all?) differences between the 5 slopes are significant. Since we’re looking to compare slopes (a.k.a. trendlines), we use the function emmeans::emtrends but in conjunction with pairs, which will make comparisons of the predicted slopes of SURPRISAL_s between all combinations of the levels of CLASS:
pairs(emtrends( # compute pairwise comparisons of trends/slopesobject=m_02, # for the model m_02specs="CLASS", # compute comparisons over this predictorvar="SURPRISAL_s"), # namely all slopes of this predictoradjust="none") # do not adjust for multiple comparisons
contrast estimate SE df t.ratio p.value
closed - number 0.3263 0.1290 6373 2.533 0.0113
closed - open 0.0188 0.0285 6373 0.659 0.5099
closed - propname 0.3391 0.1090 6373 3.105 0.0019
closed - unclear 0.7174 0.2590 6373 2.765 0.0057
number - open -0.3075 0.1300 6373 -2.371 0.0178
number - propname 0.0129 0.1670 6373 0.077 0.9387
number - unclear 0.3912 0.2890 6373 1.355 0.1754
open - propname 0.3204 0.1100 6373 2.905 0.0037
open - unclear 0.6987 0.2600 6373 2.688 0.0072
propname - unclear 0.3783 0.2800 6373 1.349 0.1774
1.3 A priori (orthogonal) contrasts
1.3.1 Theoretical introduction
Let us turn to something really interesting now. All the above comparisons were concerned with fitting a model and then, once we have a model, figuring out what’s significant and what is not, and doing so by both understanding the regular summary output and its coefficients, but also make comparisons between predictions that the summary table does not readily show us (with emmeans/emtrends). But as I already mentioned in Section 1.1 above, there are actually even more questions than one might ask, and everything we have done so far was all done post-hoc, which is why we discussed correcting for multiple post-hoc tests. But in some situations, one can set up the predictors in such a way that the summary table already answers multiple a priori hypotheses and especially multiple a priori hypotheses that differ from the standard ones checked so far, and that’s what we’re discussing now.
d <-read.delim("_input/wdurations.csv", stringsAsFactors=TRUE)d$DURATION_l <-log2(d$DURATION); d$SURPRISAL_s <-sqrt(d$SURPRISAL); summary(d[,c(12,10)])
DURATION_l CLASS
Min. : 3.807 closed :4021
1st Qu.: 6.781 number : 125
Median : 7.340 open :2086
Mean : 7.348 propname: 110
3rd Qu.: 7.937 unclear : 41
Max. :10.456
Imagine for a moment we only want to fit a monofactorial model – just for simplicity’s sake, as we’re broaching this new topic – and you have the following hypotheses about the variable CLASS:
the mean of DURATION_l for closed and open combined will be different from the corresponding mean for all other levels combined (because, say, they are ‘normal words’ and the other levels are ‘more weird’);
the mean of DURATION_l for closed will be different from the corresponding mean for open (because, say, these are still functionally, frequency-wise, and distributionally very different classes);
the mean of DURATION_l for unclear will not be different from the corresponding mean for number and propname combined (because, say, unclear words could be anything);
the mean of DURATION_l for number will not be different from the corresponding mean for propname (because these are obviously quite different classes).
It is possible to define a set of contrasts for the predictor CLASS that ‘embodies’ these 4 hypotheses so that the summary output will not show you what it usually does, but exactly the answers to these hypotheses. For that we first create a copy of CLASS, which we call CLASS_orth (for orthogonal, because these four hypotheses can be tested with 4 orthogonal contrasts):
levels(d$CLASS_orth <- d$CLASS)
[1] "closed" "number" "open" "propname" "unclear"
For a predictor with l levels (here, l=5), one can formulate l-1 orthogonal contrasts or contrast vectors – how lucky we have exactly 4 … – and the formulation of each orthogonal contrast (vector) is governed by a set of rules we need to discuss:
levels that are ignored in a comparison are set to 0;
each level can only participate in one comparison in isolation.
the sum of all value tokens of a contrast vector is 0;
(groups of) levels that are contrasted with each other differ in their sign;
the sum of all absolute values of types is 1 (this one is really important, many other sources do not include this one, which makes regression summary tables harder to interpret than necessary, as I will demonstrate in a small excursus below);
the products of the weights of the levels sum to 0.
Without an example, this is probably close to incomprehensible, so let’s look at what this means for our four hypotheses/contrasts for CLASS_orth; since we will have to define this within R as a matrix, we will do so here as well. Check out Table 2 (which uses abbreviated level names for space reasons):
Table 2: Orthogonal contrast settings for our hypotheses
CONTRAST / CLASS
closed
number
open
propname
unclear
1: closed/open vs. other 3
+0.6
-0.4
+0.6
-0.4
-0.4
2: closed vs. open
+0.5
0
-0.5
0
0
3: number/propname vs. unclear
0
-0.3333
0
-0.3333
+0.6667
4: number vs. propname
0
0.5
0
-0.5
0
So, let’s walk through those rules separately for each of the hypotheses/contrasts:
with regard to rule 1,
contrast 1: no level is ignored in this contrast: all five levels of CLASS are involved, so none of them gets a 0, ✓;
contrast 2: this is only concerned with closed and open, which means all other levels get a 0, ✓;
contrast 3: this is only concerned with number, propname, and unclear, which means all other levels get a 0, ✓;
contrast 4: this is only concerned with number and propname, which means all other levels get a 0, ✓;
with regard to rule 2, each level does only participate in one comparison in isolation:
closed is used in isolation only once, in contrast 2, ✓;
number is used in isolation only once, in contrast 4, ✓;
open is used in isolation only once, in contrast 2, ✓;
propname is used in isolation only once, in contrast 4, ✓;
unclear is used in isolation only once, in contrast 3, ✓;
with regard to rule 3, if we add up the 5 values in the each row of the contrast table, we do get 0 for each row, ✓;
with regard to rule 4,
contrast 1: the two groups of values that are contrasted differ in their sign: closed/open have a positive contrast weight whereas the other three have a negative one, ✓;
contrast 2: the two values that are contrasted differ in their sign: closed has a positive contrast weight whereas open has a negative one, ✓;
contrast 3: the two groups of values that are contrasted differ in their sign: number/propname have positive contrast weights whereas unclear has a negative one, ✓;
contrast 4: the two values that are contrasted differ in their sign: number has a positive contrast weight whereas propname has a negative one, ✓;
with regard to rule 5,
contrast 1: if we take the two value types (+0.6 and -0.4), make them both positive, and add them up, we do get 1, ✓;
contrast 2: if we take the two value types (+0.5 and -0.5), make them both positive, and add them up, we do get 1, ✓;
contrast 3: if we take the two value types (+2/3 and -1/3), make them both positive, and add them up, we do get 1, ✓;
contrast 4: if we take the two value types (+0.5 and -0.5), make them both positive, and add them up, we do get 1, ✓;
this sum is indeed 0: 0.6×0.5×0×0 + -0.4×0×-0.3333×0.5 … is 0.
Now we tell R this by making each column of Table 2 a vector, which we combine into a matrix and assign as contrasts to CLASS_orth:
# clos numb open prop unclclo.op_vs_num.prop.unc <-c(+0.6, -0.4, +0.6, -0.4, -0.4)clo_vs_op <-c(+0.5, 0 , -0.5, 0 , 0 )num.prop_vs_uncl <-c( 0 , -1/3, 0 , -1/3, +2/3)num_vs_prop <-c( 0 , +0.5, 0 , -0.5, 0 )# & then we put those into matrix like the above table & assign them as contrasts:(contrasts(d$CLASS_orth) <-# make the following the contrasts of CLASS_orthcbind( # the matrix resulting from binding together as columns clo.op_vs_num.prop.unc, # all clo_vs_op, # four num.prop_vs_uncl, # contrast num_vs_prop)) # vectors
Let’s now see what that does. We fit a first model with the default treatment contrasts of CLASS (m_03_treatm) and then a second one with our new orthogonal-contrasts predictor CLASS_orth (m_03_orthog):
summary(m_03_treatm <-lm(DURATION_l ~1+ CLASS , data=d))$coefficients %>%round(4)
To save space for presenting this, I am not showing the overall model statistics (the R2s, the F-test, etc.) but rest assured, they are all the same across the two models – all the new set of contrasts changes is the coefficients part. To recap, in m_03_treatm,
coefficient 1 (the intercept) is the predicted response when CLASS is closed;
coefficient 2 is what you add to the intercept to get the predicted response when CLASS changes from closed to number;
coefficient 3 is what you add to the intercept to get the predicted response when CLASS changes from closed to open;
coefficient 4 is what you add to the intercept to get the predicted response when CLASS changes from closed to propname;
coefficient 5 is what you add to the intercept to get the predicted response when CLASS changes from closed to unclear.
But in m_03_orthog, things are different. To understand how, let us first compute the predictions of both models, which are of course also identical (because, again, the only thing the contrasts are affecting, so to speak, is the coefficients part of the model summary).
coefficient 1 (called the intercept) is the unweighted mean of the 5 predictions for all levels of CLASS:
coef(m_03_orthog)[1]
(Intercept)
7.8455
mean(nd_treatm$PREDS) # mean(nd_orthog$PREDS)
[1] 7.8455
coefficient 2 is the difference of the unweighted mean of the 2 predictions for closed and open on the one hand and the unweighted mean of the 3 predictions for number and propname and unclear on the other, and the fact that it is significant says our first hypothesis is confirmed:
summary(m_03_orthog)$coefficients[2,]
Estimate Std. Error t value Pr(>|t|)
-6.424391e-01 5.029463e-02 -1.277351e+01 6.512723e-37
coefficient 3 is the difference of the prediction for closed and the prediction for open, and the fact that it is significant says our second hypothesis is confirmed:
summary(m_03_orthog)$coefficients[3,]
Estimate Std. Error t value Pr(>|t|)
-0.93901065 0.01960634 -47.89320883 0.00000000
coefficient 4 is the difference of the unweighted mean of the 2 predictions for number and propname on the one hand and the prediction for unclear on the other, and the fact that it is significant says our third hypothesis is not confirmed:
summary(m_03_orthog)$coefficients[4,]
Estimate Std. Error t value Pr(>|t|)
-0.393280161 0.123017647 -3.196941003 0.001395713
coefficient 5 is the difference of the prediction for number and the prediction for propname, and the fact that it is significant says our fourth hypothesis is not confirmed:
summary(m_03_orthog)$coefficients[5,]
Estimate Std. Error t value Pr(>|t|)
-5.807882e-01 9.499252e-02 -6.114042e+00 1.028827e-09
So yes, this comes with some work (of setting up the contrast matrix), but then it’s incredibly elegant: the summary output answers all your questions, or addresses all your hypotheses, and it does that even though the hypotheses involve multiple conflations etc. that would otherwise be very difficult to do. This way of defining orthogonal contrasts can streamline analyses a lot.
1.3.3 Excursus: what if you don’t follow rule 5?
Let’s quickly see why I personally find rule 5 so important, even if most other references I know don’t use it. Those other references, I think, try to make developing the contrast matrices simpler by using integers for the contrast vectors rather than the decimals/fractions that my rule 5 here implies. Therefore, they would define the contrast vectors not like I did above (repeated here for convenience) …
This sure looks easier (at first) but that ‘backfires’ – let me show you how. Here are the summary results of the model defined with my contrasts, with rule 5 in place:
The reason for my preference becomes clear when you realize that all coefficients (but the intercept, the mean of predicted means) differ from each other and how. Here are the predictions of both models again:
nd_orthog
CLASS_orth PREDS
1 closed 6.990532
2 number 7.943175
3 open 7.929542
4 propname 8.523963
5 unclear 7.840289
We have seen above that, when you use my contrasts with rule 5, the coefficients in the summary table correspond to exactly the differences between the (means of grouped) means of contrasted levels. But that’s not the case anymore if you use the integer-based coding. Now
the coefficient CLASS_orth_notrule5clo.op_vs_num.prop.unc is -0.1284878, which is exactly 1/5 of the real difference, which we know is -0.6424391;
the coefficient CLASS_orth_notrule5clo_vs_op is -0.4695053, which is exactly 1/2 of the real difference, which we know is -0.9390107;
the coefficient CLASS_orth_notrule5num.prop_vs_uncl is -0.1310934, which is exactly 1/3 of the real difference, which we know is -0.3932802;
the coefficient CLASS_orth_notrule5num_vs_prop is -0.2903941, which is exactly 1/2 of the real difference, which we know is -0.5807882.
In other words, using the integers as we did reduces the coefficients by a factor of f, where f is the sum of the absolute values of the types of a contrast vector. Thus,
if you define clo.op_vs_num.prop.unc as c(3, -2, 3, -2, -2), then the sum of the absolute values of the types is 5 and we saw the coefficient for that contrast in the model not following rule 5 is indeed reduced to just 1/5th of the real difference, but
if you define clo.op_vs_num.prop.unc as c(0.6, -0.4, 0.6, -0.4, -0.4), then, as per rule 5, the sum of the absolute values of the types is 1 so the coefficient for that contrast in the model following rule 5 is indeed 1/1th of the real difference.
Thus, yes, of course you can save mental effort and use the integers, but you then need to invest more mental effort, I think, to reconstruct the real differences from the summary table – better use rule 4 right away.
1.4 Exercises
1.4.1 Exercise 1
A corpus-based study of four varieties of English – a predictor called VARIETY with the 4 levels NigE, BrE, IndE, and AmE (in that order!) – set out to test the hypotheses that
NigE is significantly different from all others combined;
IndE is significantly different from the two inner-circle varieties AmE and BrE combined;
the two inner-circle varieties AmE and BrE are not significantly different.
Formulate the required contrast matrix for this predictor VARIETY.
A corpus-based study of 5 ‘varieties’ of English – a predictor called L1 with 5 levels EN (i.e. native speakers), FR, GE, NO, and SPA (in that order!) – set out to test the hypotheses that
the native speakers are significantly different from all others combined;
the two Germanic L1s NO and GE combined are significantly different from the two Romance L1s FR and SPA combined;
the two Germanic L1s NO and GE are not significantly different from each other;
the two Romance L1s FR and SPA are not significantly different from each other.
Formulate the required contrast matrix for this predictor L1.
---title: "Ling 204: session 02"author: - name: "[Stefan Th. Gries](https://www.stgries.info)" affiliation: - UC Santa Barbara - JLU Giessen orcid: 0000-0002-6497-3958date: "2026-01-14 12:34:56 PDT"date-format: "DD MMM YYYY HH-mm-ss"editor: sourceformat: html: page-layout: full code-fold: false code-link: true code-copy: true code-tools: true code-line-numbers: true code-overflow: scroll number-sections: true smooth-scroll: true toc: true toc-depth: 4 number-depth: 4 toc-location: left monofont: lucida console tbl-cap-location: top fig-cap-location: bottom fig-width: 4 fig-height: 4 fig-format: png fig-dpi: 300 fig-align: center embed-resources: true link-external-newwindow: trueexecute: cache: false echo: true eval: true warning: false---# Session 02: Coefficient comparisons 1 {#sec-s02}## Introduction: the 'incompleteness' of regression summaries {#sec-s02_incomplete}In this session, we will look at how to get more detail out of models and their summaries. The first part, @sec-s02_posthoc, is concerned with post-hoc comparisons while the second part, @sec-s02_apriori, deals with a priori motivated comparisons (which often are not even straightforwardly approachable from an existing model summary). But as an introduction, let's consider the issue of how much of what might be interesting a regression summary actually does (not) show, for which we briefly consider the following linear regression model on the [duration data](_input/wdurations.csv) using just two predictors:* `SURPRISAL_s`, a square-root transformed surprisal value;* `CLASS`: a categorical predictor categorizing words in terms of 5 different levels:```{r}rm(list=ls(all.names=TRUE)); pkgs2load <-c("car", "effects", "emmeans", "multcomp")suppressMessages(sapply(pkgs2load, library, character.only=TRUE, logical.return=TRUE, quietly=TRUE)); rm(pkgs2load)d <-read.delim("_input/wdurations.csv", stringsAsFactors=TRUE)d$DURATION_l <-log2(d$DURATION); d$SURPRISAL_s <-sqrt(d$SURPRISAL); summary(d[,c(12:13,10)])m <-lm(DURATION_l ~1+ SURPRISAL_s*CLASS, data=d)```Consider all the questions that one might have with regard to just the interaction of `SURPRISAL_s:CLASS`. Once you think about it for a moment, I hope you realize that the questions one might have are potentially all of these:1. The slope of `SURPRISAL_s` when `CLASS` is *closed*, a. what is it? b. is it significantly different from 0? c. is it significantly different from the slope of `SURPRISAL_s` when `CLASS` is *number*? d. is it significantly different from the slope of `SURPRISAL_s` when `CLASS` is *open*? e. is it significantly different from the slope of `SURPRISAL_s` when `CLASS` is *propname*? f. is it significantly different from the slope of `SURPRISAL_s` when `CLASS` is *unclear*?2. The slope of `SURPRISAL_s` when `CLASS` is *number*, a. what is it? b. is it significantly different from 0? d. is it significantly different from the slope of `SURPRISAL_s` when `CLASS` is *open*? e. is it significantly different from the slope of `SURPRISAL_s` when `CLASS` is *propname*? f. is it significantly different from the slope of `SURPRISAL_s` when `CLASS` is *unclear*?3. The slope of `SURPRISAL_s` when `CLASS` is *open*, a. what is it? b. is it significantly different from 0? c. is it significantly different from the slope of `SURPRISAL_s` when `CLASS` is *propname*? d. is it significantly different from the slope of `SURPRISAL_s` when `CLASS` is *unclear*?4. The slope of `SURPRISAL_s` when `CLASS` is *propname*, a. what is it? b. is it significantly different from 0? c. is it significantly different from the slope of `SURPRISAL_s` when `CLASS` is *unclear*?5. The slope of `SURPRISAL_s` when `CLASS` is *unclear*, a. what is it? b. is it significantly different from 0?Again, these are just the most basic questions that would probably be of interest -- I am leaving out more specialized ones based on conflations and possibly interesting tests for specific values.But now consider again what most people provide, which, if we're lucky, is this:```{r}summary(m)$coef %>%round(4)plot(allEffects(m), grid=TRUE)```You see there's a big discrepancy. Which of the above questions does this output answer (directly or after one addition)?* when `CLASS` is *closed*, this answers all 6 questions (a to f);* when `CLASS` is *number*, this answers 1 out of 5 questions (a);* when `CLASS` is *open*, this answers 1 out of 4 questions (a);* when `CLASS` is *propname*, this answers 1 out of 3 questions (a);* when `CLASS` is *unclear*, this answers 1 out of 2 questions (a).Not a great result: we get 10 answers for minimally 20 questions and for many of the as yet unanswered questions you may not even know yet how to get them -- this session and the next will change this!We begin by re-loading and re-preparing the same [duration data](_input/wdurations.csv); this time we're centering the surprisal predictor:```{r}rm(list=ls(all.names=TRUE))d <-read.delim("_input/wdurations.csv", stringsAsFactors=TRUE)d$DURATION_l <-log2(d$DURATION)d$SURPRISAL_sc <-sqrt(d$SURPRISAL) -mean(sqrt(d$SURPRISAL))```We fit a first model:```{r}(m_01 <-lm(data=d, formula=DURATION_l ~1+ SURPRISAL_sc + VOWELS + CLASS + SURPRISAL_sc:VOWELS + SURPRISAL_sc:CLASS + VOWELS:CLASS)) %>% { list("Summary"=summary(.),"Drop 1"=drop1(., test="F") %>% as.data.frame) }```In the interest of covering everything -- all 20 questions from above and more -- in excruciating detail, we'll go over nine kinds of scenarios (yes, nine, you can thank me later ...) which arise from the combination of two three-possibility scenarios:| | **a**: what are the means/slopes? | **b**: are they * diff from 0? | **c**: are they * diff from each other? ||---------------------------------------------------|-----------------------------------|--------------------------------|-----------------------------------------|| **1**: means for (combos of) levels of cat preds | scenario 1a | scenario 1b | scenario 1c || **2**: means for values of numeric preds | scenario 2a | scenario 2b | scenario 2c || **3**: slopes for (combos of) levels of cat preds | scenario 3a | scenario 3b | scenario 3c |: 9 scenarios relevant for post-hoc exploration of models {#tbl-threeTIMESthree}Yes, this seems annoying but each of the three questions in the columns has potentially important implications, and the way they are tackled depends on what one looks at: means, slopes, etc. And, maybe somewhat confusingly once you recognize it, none of these is fully answered by the summary output and hardly any study I see discusses all three.## Post-hoc comparisons {#sec-s02_posthoc}### Scen. 1a: means for catScenario 1a means to ask the following question: When `SURPRISAL_sc` is controlled for, what is each of the 10 predicted means of `DURATION_l` for when you cross the 2 levels of `VOWELS` and the 5 levels of `CLASS`? A simple way to answer all scenario 1 questions and, maybe more generally, to make sense of especially the coefficients of models that I often use is useful here as well: We create a data frame `nd` (for 'new data') that contains all variables in the rhs of the model formula and crosses* all levels of all categorical variables in the rhs (here `VOWELS` and `CLASS`);* the values of 0 and 1 for all numeric variables in the rhs and when the numeric predictors are not centered, I usually also include the mean value for each numeric variable:```{r}nd <-expand.grid( # generate all combinations ofSURPRISAL_sc=0:1, # main values of SURPRISAL_scCLASS=levels(d$CLASS), # all levels of CLASSVOWELS=levels(d$VOWELS)) %>%# all levels of VOWELS, then"["(,3:1) # reorder columns```To that data frame, we then add the predictions the final model makes for all these combinations and, when easily available, also their confidence intervals; we can then just look at the whole data frame or create different tabular versions of it -- here's one I find useful (because the rows let us see the predictions for `SURPRISAL_sc` 0 and 1 well, the difference of which is the slope of `SURPRISAL_sc`):```{r}nd <-cbind(nd, PREDS=predict( m_01, newdata=nd, interval="confidence"))colnames(nd)[4:6] <-c("fit", "lwr", "upr")ndtapply(FUN=c, X=nd$fit, # apply the function c to the predictionsINDEX=list( # which are grouped as follows:CLASS=nd$CLASS, # rowsSURPRISAL_sc=nd$SURPRISAL_sc, # columnsVOWELS=nd$VOWELS)) # slices```In this case, half of these are what we get from `effect`, which is one of the most efficient ways to answer 1a questions. This is because `effect` conveniently holds the predictors not included in the term at 'typical values', which for the centered (!) values of `SURPRISAL_sc` means the `effects` values are those in `nd` when `SURPRISAL_sc` is 0:```{r}(clvo_d <-data.frame( clvo <-effect("VOWELS:CLASS", m_01)))```### Scen. 1b: means for cat vs. 0Scenario 1b means to ask whether each of the 10 means computed above is significantly different from 0 (and what the exact *p*-value is). This can so far not be answered fully from the summary output of the model. We can see from `nd` that none of the predicted means' confidence intervals overlaps with 0, but we don't have exact *p*-values. Right now, we know the answer to this question only for 1 of the 10 means, the one represented by the intercept in `summary(m_01)$coefficients`: `CLASS`: *closed* with `VOWELS`: *one*, which clearly is significantly different from 0. With your current knowledge, the 'best' way to proceed for you might be to generate `m_01` 9 more times, each time with `CLASS` and `VOWELS` re-leveled in such ways that the intercept comes to represent all 10 possible combinations -- clearly not a desirable option, which is why we will discuss a much better approach another time.### Scen. 1c: means for cat vs. each otherScenario 1c means to ask whether (some? all?) differences between the 10 means are significant. For that, we can use `emmeans::emmeans` and `pairs`. For a 2-way interaction like this, there are three options -- two of them (options a and b) are usually/often pretty good/useful and justifiable, the last one (option c) often not so much.Option a here is to make comparisons of the predicted means for the levels of `CLASS` within the levels of `VOWELS`:```{r}pairs(emmeans( # compute pairwise comparisons of meansobject=m_01, # for the model m_01specs=~ CLASS | VOWELS), # of CLASS within each level of VOWELS# you could also write this like this:# specs="CLASS", # of CLASS# by="VOWELS"), # within each level of VOWELSadjust="none") # do not adjust for multiple comparisons```You can see that these results nicely map onto our data frame `nd`; here are the first two:```{r}round(nd$fit[1] - nd$fit[3], 4)round(nd$fit[1] - nd$fit[5], 4)```But note that we should probably adjust for multiple comparisons because the high number of post-hoc comparisons we're making increases the probability of false positives. The [Holm method](https://en.wikipedia.org/wiki/Holm%E2%80%93Bonferroni_method) is a useful and not too conservative way to do that:```{r}pairs(emmeans( # compute pairwise comparisons of meansobject=m_01, # for the model m_01specs=~ CLASS | VOWELS), # of CLASS within each level of VOWELSadjust="holm") # adjust for multiple comparisons w/ Holm's method```For the most part, I will continue to *not* adjust for multiple tests, but that's also only for didactic reasons, namely so you can make comparisons to the models' summaries -- in real applications, adjusting for multiple tests is often useful if not required.Option b here is the reverse perspective from the last one: to make comparisons of the predicted means for the levels of `VOWELS` within the levels of `CLASS`:```{r}pairs(emmeans( # compute pairwise comparisons of meansobject=m_01, # for the model m_01specs=~ VOWELS | CLASS), # of VOWELS within each level of CLASS# you could also write this like this:# specs="VOWELS", # of VOWELS# by="CLASS"), # within each level of CLASSadjust="none") # do not adjust for multiple comparisons```Again, this maps perfectly onto what we'd see in `nd`; here are the first two:```{r}round(nd$fit[1] - nd$fit[11], 4)round(nd$fit[3] - nd$fit[13], 4)```Option c, the not recommended one, is to make all possible comparisons across all combinations of levels -- fishing for results by considering even comparisons no one would ever postulate a priori; for that extreme fishing, adjusting for multiple comparisons is pretty much obligatory.```{r}pairs(emmeans( # compute pairwise comparisons of meansobject=m_01, # for the model m_01specs=~ VOWELS:CLASS), # of VOWELS:CLASSadjust="holm") # adjust for multiple comparisons w/ Holm's method```Now, what if you had only been interested in a subset of the comparisons that `pairs(emmeans(...))` makes? If you were interested in only, say, 5 comparisons, you don't want `pairs` to adjust for 10. In such cases, you can extract the 5 non-adjusted *p*-values you're interested in and correct them yourself only for the 5 tests you're making. Imagine you were only interested in the comparisons 11 to 15 shown in the following output:```{r}(qwe <-pairs(emmeans( # compute pairwise comparisons of meansobject=m_01, # for the model m_01specs=~ CLASS | VOWELS), # of CLASS within each level of VOWELSadjust="none")) # do not adjust for multiple comparisons```Then this is how you would proceed because the function `p.adjust` determines how many tests to adjust for by looking at the length of the `p` argument, here 5:```{r}p.adjust(p=data.frame(qwe)[11:15,7],method="holm")```### Scen. 2a: means for numTo deal with the 3 scenarios 2a, 2b, and 2c, we now look at a new model `m_02`, which is slightly simpler (we leave out `VOWELS`) and fit slightly differently; importantly, note that `SURPRISAL` is now square-root transformed, but not centered:```{r}d <-read.delim("_input/wdurations.csv", stringsAsFactors=TRUE)d$DURATION_l <-log2(d$DURATION) # we log DURATION againd$SURPRISAL_s <-sqrt(d$SURPRISAL) # SURPRISAL is now not centered(m_02 <-lm(data=d, formula=DURATION_l ~1+ SURPRISAL_s*CLASS)) %>% { list("Summary"=summary(.),"Drop 1"=drop1(., test="F") %>% as.data.frame) }```Scenario 2a means to ask the following question: what are the 5 predicted means of `DURATION_l` for each of the levels of `CLASS` and when `SURPRISAL_s` is controlled for in some way? The summary output shows only 1 of the 5, the one represented by the intercept in `summary(m_02)$coefficients`, i.e. when `CLASS` is *closed* and `SURPRISAL_s` is 0. If we want the means for all 5 levels of `CLASS` for, for instance, when `SURPRISAL_s` is 0, with the present model we can compute them like this:```{r}coef(m_02)[1] +c(0, coef(m_02)[3:6])```But the `nd` approach is sometimes more intuitive: Here are the predicted means of `DURATION_l` for the levels of `CLASS` when the predictor `SURPRISAL_s` is 0 (the intercept), 1 (for the slope difference), and its mean (`r mean(d$SURPRISAL_s)`)```{r}nd <-expand.grid(SURPRISAL_s=c(0, 1, mean(d$SURPRISAL_s)), # main values of SURPRISAL_sCLASS=levels(d$CLASS)) %>%# all levels of CLASS, then"["(,2:1) # reorder columnsnd <-cbind(nd, predict( m_02, newdata=nd, interval="confidence"))nd[nd$SURPRISAL_s==0,] # only part of nd, when SURP_s is 0```We can also get those from `effect` like this, but ...```{r}(cl_d <-data.frame( cl <-effect("CLASS", m_02)))```... you can see we're getting a warning because `CLASS` is in the interaction with `SURPRISAL_s` so we can't just ignore it. Well, we can, kinda, but only because `effect` is smart enough to give us the predictions for the 5 levels of `CLASS` while `SURPRISAL_s` is held constant at its mean (which now, that we using the non-centered `d$SURPRISAL_s`, is not 0 but `r mean(d$SURPRISAL_s)`). But the better way to do this would be like this:```{r}sucl_d <-data.frame( sucl <-effect("SURPRISAL_s:CLASS", m_02,xlevels=list(SURPRISAL_s=c(0:1, mean(d$SURPRISAL_s))))) %>%"["(,c(2,1,3,5,6))# alternative:# sucl_d <- data.frame(# sucl <- predictorEffect(# "SURPRISAL_s", m_02,# focal.levels=c(0:1, mean(d$SURPRISAL_s))))%>% "["(,c(2,1,3,5,6))sucl_d[sucl_d$SURPRISAL_s==0,] # only part of nd, when SURP_s is 0```### Scen. 2b: means for num vs. 0Scenario 2b means to ask whether each of the 5 means computed above is significantly different from 0 (and what the exact *p*-value is). This can so far not be answered fully from the summary output of the model. We can see from `nd` that none of the predicted means' confidence intervals overlaps with 0, but we don't have *p*-values. Right now, we know the answer to this question only for 1 of the 10 means, the one represented by the intercept in `summary(m_02)$coefficients`, i.e. when `CLASS`: *closed* and `SURPRISAL_s` is 0, which clearly is significantly different from 0. With your current knowledge, the 'best' way to proceed for you would be to generate `m_02` 4 more times, each time with `CLASS` re-leveled in such way that the intercept comes to represent all 5 possible levels -- clearly not a desirable option, which is why we will discuss a much better approach another time.### Scen. 2c: means for num vs. each otherScenario 2c means to ask whether (some? all?) differences between the 5 means are significant. For that, we can use `emmeans` again:```{r}(qwe <-pairs(emmeans( # compute pairwise comparisons of meansobject=m_02, # for the model m_02specs=~ CLASS | SURPRISAL_s), # of CLASS SURPRISAL_s is its meanadjust="none")) # do not adjust for multiple comparisons```Note, these results do not map onto the results from the summary table or the values from `nd`! From the summary table, we would expect the difference between `CLASS` *closed* and `CLASS` *open* to be this:```{r}summary(m_02)$coefficients[4,"Estimate"]```Same from `nd`:```{r}nd$fit[7] - nd$fit[1]```But that's not what `pairs(emmeans(...))` says:```{r}data.frame(qwe)[2,-2]```That is because the summary of the model and the part of `nd` we looked at above returns predictions when `SURPRISAL_s` is 0 whereas `pairs(emmeans(...))` returns predictions when `SURPRISAL_s` is its mean, namely `r mean(d$SURPRISAL_s)`. Thus, these results *would* be identical if `SURPRISAL` had been centered but are not this time. Without centering, things fit only if we look at the 'right' part of `nd`, and that is of course why `nd` was above created that way:```{r}nd$fit[3] - nd$fit[9]data.frame(qwe)[2,]```But can we get predictions/comparisons that match the summary output, i.e. for when `SURPRISAL_s` is 0, from `emmeans` as well? Yes, because `emmeans` can take an `at` argument:```{r}pairs(emmeans( # compute pairwise comparisons of meansobject=m_02, # for the model m_02specs=~ CLASS | SURPRISAL_s, # of CLASS SURPRISAL_s is ...at=list(SURPRISAL_s=0)), # set to 0, not its mean!adjust="none") # do not adjust for multiple comparisons```### Scen. 3a: slopes for numLet's continue with `m_02`, because it has the categorical predictor `CLASS`, whose predicted means we just looked at, but it also has a numeric predictor `SURPRISAL_s`, and since there is a significant interaction between those two predictors, our model involves 5 different slopes of `SURPRISAL_s`, one for each level of `CLASS`, with some significant differences between them.```{r}summary(m_02)$coefficients```Scenario 3a means to ask the following question: what are the 5 predicted slopes of `SURPRISAL_s` for each of the levels of `CLASS`? We can compute those slopes manually from the coefficients of `m_02`:* the slope of `SURPRISAL_s` when `CLASS` is *closed* is listed in the summary table as the 2nd coefficient, the one called `SURPRISAL_s`, because that's the slope of the predictor when `CLASS` is its first level: (`coef(m_02)[2]`, i.e. `r coef(m_02)[2]`);* the slope of `SURPRISAL_s` when `CLASS` is *number* is that 2nd coefficient plus the 7th coefficient, the one called `SURPRISAL_s:CLASSnumber`, i.e. the adjustment to the slope of `SURPRISAL_s` when `CLASS` is not *closed* but *number*: `coef(m_02)[2] + coef(m_02)[7]`, i.e. `r coef(m_02)[2] + coef(m_02)[7]`;* the slope of `SURPRISAL_s` when `CLASS` is *open* is that 2nd coefficient plus the 8th coefficient, the one called `SURPRISAL_s:CLASSopen`, i.e. the adjustment to the slope of `SURPRISAL_s` when `CLASS` is not *closed* but *open*: `coef(m_02)[2] + coef(m_02)[8]`, i.e. `r coef(m_02)[2] + coef(m_02)[8]`;* the slope of `SURPRISAL_s` when `CLASS` is *propname* is that 2nd coefficient plus the 9th coefficient, the one called `SURPRISAL_s:CLASSpropname`, i.e. the adjustment to the slope of `SURPRISAL_s` when `CLASS` is not *closed* but *propname*: `coef(m_02)[2] + coef(m_02)[9]`, i.e. `r coef(m_02)[2] + coef(m_02)[9]`;* the slope of `SURPRISAL_s` when `CLASS` is *unclear* is that 2nd coefficient plus the 10th coefficient, the one called `SURPRISAL_s:CLASSunclear`, i.e. the adjustment to the slope of `SURPRISAL_s` when `CLASS` is not *closed* but *unclear*: `coef(m_02)[2] + coef(m_02)[10]`, i.e. `r coef(m_02)[2] + coef(m_02)[10]`.Or, more concisely:```{r}coef(m_02)[2] +c(0, coef(m_02)[7:10])```But why not use `nd` instead? This is what it looked like:```{r}nd```Thus, we could immediately get the slopes like this:```{r}tapply(nd$fit, # apply a grouping of the predictionslist(nd$CLASS, # by CLASS nd$SURPRISAL_s), # and SURPRISAL_s c) %>%# just list the values,"["(,-3) %>%# suppress the 3rd columnapply(1, diff) # compute the differences```Or, you could do the same like this:```{r}nd.split <-split(nd, nd$SURPRISAL_s)nd.split$`1`$fit - nd.split$`0`$fit```But the most practical solution might be the function `emmeans::emtrends`:```{r}emtrends( # compute the trends/slopesobject=m_02, # for the model m_02specs="CLASS", # compute comparisons over this predictorvar="SURPRISAL_s") # namely all slopes of this predictor```### Scen. 3b: slopes for num vs. 0Scenario 3b means to ask whether each of the 5 slopes just computed is significantly different from 0 (and what the exact *p*-value is). This can so far not be answered from the summary output of the model. Right now, we know the answer to this question only for 1 of the 5 slopes, the one represented by the 2nd coefficient (i.e., when `CLASS`: *closed*). With your current knowledge, the 'best' way to proceed for you would be to generate `m_01` 4 more times, each time with `CLASS` re-leveled in such way that the 2nd coefficient comes to represent all 5 possible values -- clearly not a desirable option, which is why we will discuss a much better approach another time.### Scen. 3c: slopes for num vs. each otherScenario 3c means to ask whether (some? all?) differences between the 5 slopes are significant. Since we're looking to compare slopes (a.k.a. trendlines), we use the function `emmeans::emtrends` but in conjunction with `pairs`, which will make comparisons of the predicted slopes of `SURPRISAL_s` between all combinations of the levels of `CLASS`:```{r}pairs(emtrends( # compute pairwise comparisons of trends/slopesobject=m_02, # for the model m_02specs="CLASS", # compute comparisons over this predictorvar="SURPRISAL_s"), # namely all slopes of this predictoradjust="none") # do not adjust for multiple comparisons```## A priori (orthogonal) contrasts {#sec-s02_apriori}### Theoretical introductionLet us turn to something really interesting now. All the above comparisons were concerned with fitting a model and then, once we have a model, figuring out what's significant and what is not, and doing so by both understanding the regular `summary` output and its coefficients, but also make comparisons between predictions that the `summary` table does not readily show us (with `emmeans`/`emtrends`). But as I already mentioned in @sec-s02_incomplete above, there are actually even more questions than one might ask, and everything we have done so far was all done post-hoc, which is why we discussed correcting for multiple post-hoc tests. But in some situations, one can set up the predictors in such a way that the `summary` table already answers multiple a priori hypotheses and especially multiple a priori hypotheses that differ from the standard ones checked so far, and that's what we're discussing now.We're reloading the duration data:```{r}rm(list=ls(all.names=TRUE)); pkgs2load <-c("car", "effects", "emmeans", "multcomp")suppressMessages(sapply(pkgs2load, library, character.only=TRUE, logical.return=TRUE, quietly=TRUE)); rm(pkgs2load)d <-read.delim("_input/wdurations.csv", stringsAsFactors=TRUE)d$DURATION_l <-log2(d$DURATION); d$SURPRISAL_s <-sqrt(d$SURPRISAL); summary(d[,c(12,10)])```Imagine for a moment we only want to fit a monofactorial model -- just for simplicity's sake, as we're broaching this new topic -- and you have the following hypotheses about the variable `CLASS`:1. the mean of `DURATION_l` for *closed* and *open* combined will be different from the corresponding mean for all other levels combined (because, say, they are 'normal words' and the other levels are 'more weird');2. the mean of `DURATION_l` for *closed* will be different from the corresponding mean for *open* (because, say, these are still functionally, frequency-wise, and distributionally very different classes);3. the mean of `DURATION_l` for *unclear* will not be different from the corresponding mean for *number* and *propname* combined (because, say, unclear words could be anything);4. the mean of `DURATION_l` for *number* will not be different from the corresponding mean for *propname* (because these are obviously quite different classes).It is possible to define a set of contrasts for the predictor `CLASS` that 'embodies' these 4 hypotheses so that the summary output will not show you what it usually does, but exactly the answers to these hypotheses. For that we first create a copy of `CLASS`, which we call `CLASS_orth` (for *orthogonal*, because these four hypotheses can be tested with 4 orthogonal contrasts):```{r}levels(d$CLASS_orth <- d$CLASS)```For a predictor with *l* levels (here, *l*=5), one can formulate *l*-1 orthogonal contrasts or contrast vectors -- how lucky we have exactly 4 ... -- and the formulation of each orthogonal contrast (vector) is governed by a set of rules we need to discuss:1. levels that are ignored in a comparison are set to 0;2. each level can only participate in one comparison in isolation.3. the sum of all value tokens of a contrast vector is 0;4. (groups of) levels that are contrasted with each other differ in their sign;5. the sum of all absolute values of types is 1 (this one is really important, many other sources do not include this one, which makes regression summary tables harder to interpret than necessary, as I will demonstrate in a small excursus below);6. the products of the weights of the levels sum to 0.Without an example, this is probably close to incomprehensible, so let's look at what this means for our four hypotheses/contrasts for `CLASS_orth`; since we will have to define this within R as a matrix, we will do so here as well. Check out @tbl-orthcontr (which uses abbreviated level names for space reasons):| CONTRAST / `CLASS` | *closed* | *number* | *open* | *propname* | *unclear* ||:-------------------------------------|:---------|:---------|:-------|:-----------|:----------|| 1: *closed*/*open* vs. other 3 | +0.6 | -0.4 | +0.6 | -0.4 | -0.4 || 2: *closed* vs. *open* | +0.5 | 0 | -0.5 | 0 | 0 || 3: *number*/*propname* vs. *unclear* | 0 | -0.3333 | 0 | -0.3333 | +0.6667 || 4: *number* vs. *propname* | 0 | 0.5 | 0 | -0.5 | 0 |: Orthogonal contrast settings for our hypotheses {#tbl-orthcontr}So, let's walk through those rules separately for each of the hypotheses/contrasts:1. with regard to rule 1, a. contrast 1: no level is ignored in this contrast: all five levels of `CLASS` are involved, so none of them gets a 0, ✓; b. contrast 2: this is only concerned with *closed* and *open*, which means all other levels get a 0, ✓; c. contrast 3: this is only concerned with *number*, *propname*, and *unclear*, which means all other levels get a 0, ✓; d. contrast 4: this is only concerned with *number* and *propname*, which means all other levels get a 0, ✓;2. with regard to rule 2, each level does only participate in one comparison in isolation: a. *closed* is used in isolation only once, in contrast 2, ✓; b. *number* is used in isolation only once, in contrast 4, ✓; c. *open* is used in isolation only once, in contrast 2, ✓; d. *propname* is used in isolation only once, in contrast 4, ✓; e. *unclear* is used in isolation only once, in contrast 3, ✓;3. with regard to rule 3, if we add up the 5 values in the each row of the contrast table, we do get 0 for each row, ✓;4. with regard to rule 4, a. contrast 1: the two groups of values that are contrasted differ in their sign: *closed*/*open* have a positive contrast weight whereas the other three have a negative one, ✓; b. contrast 2: the two values that are contrasted differ in their sign: *closed* has a positive contrast weight whereas *open* has a negative one, ✓; c. contrast 3: the two groups of values that are contrasted differ in their sign: *number*/*propname* have positive contrast weights whereas *unclear* has a negative one, ✓; d. contrast 4: the two values that are contrasted differ in their sign: *number* has a positive contrast weight whereas *propname* has a negative one, ✓;5. with regard to rule 5, a. contrast 1: if we take the two value types (+0.6 and -0.4), make them both positive, and add them up, we do get 1, ✓; b. contrast 2: if we take the two value types (+0.5 and -0.5), make them both positive, and add them up, we do get 1, ✓; c. contrast 3: if we take the two value types (+^2^/~3~ and -^1^/~3~), make them both positive, and add them up, we do get 1, ✓; d. contrast 4: if we take the two value types (+0.5 and -0.5), make them both positive, and add them up, we do get 1, ✓;6. this sum is indeed 0: 0.6×0.5×0×0 + -0.4×0×-0.3333×0.5 ... is 0.Now we tell R this by making each column of @tbl-orthcontr a vector, which we combine into a matrix and assign as contrasts to `CLASS_orth`:```{r}# clos numb open prop unclclo.op_vs_num.prop.unc <-c(+0.6, -0.4, +0.6, -0.4, -0.4)clo_vs_op <-c(+0.5, 0 , -0.5, 0 , 0 )num.prop_vs_uncl <-c( 0 , -1/3, 0 , -1/3, +2/3)num_vs_prop <-c( 0 , +0.5, 0 , -0.5, 0 )# & then we put those into matrix like the above table & assign them as contrasts:(contrasts(d$CLASS_orth) <-# make the following the contrasts of CLASS_orthcbind( # the matrix resulting from binding together as columns clo.op_vs_num.prop.unc, # all clo_vs_op, # four num.prop_vs_uncl, # contrast num_vs_prop)) # vectors```### Implications for model coefficientsLet's now see what that does. We fit a first model with the default treatment contrasts of `CLASS` (`m_03_treatm`) and then a second one with our new orthogonal-contrasts predictor `CLASS_orth` (`m_03_orthog`):```{r}summary(m_03_treatm <-lm(DURATION_l ~1+ CLASS , data=d))$coefficients %>%round(4)summary(m_03_orthog <-lm(DURATION_l ~1+ CLASS_orth, data=d))$coefficients %>%round(4)```To save space for presenting this, I am not showing the overall model statistics (the *R*^2^s, the *F*-test, etc.) but rest assured, they are all the same across the two models -- all the new set of contrasts changes is the coefficients part. To recap, in `m_03_treatm`,* coefficient 1 (the intercept) is the predicted response when `CLASS` is *closed*;* coefficient 2 is what you add to the intercept to get the predicted response when `CLASS` changes from *closed* to *number*;* coefficient 3 is what you add to the intercept to get the predicted response when `CLASS` changes from *closed* to *open*;* coefficient 4 is what you add to the intercept to get the predicted response when `CLASS` changes from *closed* to *propname*;* coefficient 5 is what you add to the intercept to get the predicted response when `CLASS` changes from *closed* to *unclear*.But in `m_03_orthog`, things are different. To understand how, let us first compute the predictions of both models, which are of course also identical (because, again, the only thing the contrasts are affecting, so to speak, is the coefficients part of the model summary).```{r}nd_treatm <-data.frame(CLASS=levels(d$CLASS))(nd_treatm <-cbind(nd_treatm, PREDS=predict(m_03_treatm, newdata=nd_treatm)))nd_orthog <-data.frame(CLASS_orth=levels(d$CLASS_orth))(nd_orthog <-cbind(nd_orthog, PREDS=predict(m_03_orthog, newdata=nd_orthog)))```Now, what do the coefficients of `m_03_orthog` mean?```{r}coef(m_03_orthog)```* coefficient 1 (called the intercept) is the unweighted mean of the 5 predictions for all levels of `CLASS`:```{r}coef(m_03_orthog)[1]mean(nd_treatm$PREDS) # mean(nd_orthog$PREDS)```* coefficient 2 is the difference of the unweighted mean of the 2 predictions for *closed* and *open* on the one hand and the unweighted mean of the 3 predictions for *number* and *propname* and *unclear* on the other, and the fact that it is significant says our first hypothesis is confirmed:```{r}summary(m_03_orthog)$coefficients[2,]mean(nd_orthog$PREDS[c(1,3)]) -mean(nd_orthog$PREDS[c(2,4,5)])```* coefficient 3 is the difference of the prediction for *closed* and the prediction for *open*, and the fact that it is significant says our second hypothesis is confirmed:```{r}summary(m_03_orthog)$coefficients[3,]mean(nd_orthog$PREDS[1]) -mean(nd_orthog$PREDS[3])```* coefficient 4 is the difference of the unweighted mean of the 2 predictions for *number* and *propname* on the one hand and the prediction for *unclear* on the other, and the fact that it is significant says our third hypothesis is not confirmed:```{r}summary(m_03_orthog)$coefficients[4,]mean(nd_orthog$PREDS[5]) -mean(nd_orthog$PREDS[c(2,4)])```* coefficient 5 is the difference of the prediction for *number* and the prediction for *propname*, and the fact that it is significant says our fourth hypothesis is not confirmed:```{r}summary(m_03_orthog)$coefficients[5,]mean(nd_orthog$PREDS[2]) -mean(nd_orthog$PREDS[4])```So yes, this comes with some work (of setting up the contrast matrix), but then it's incredibly elegant: the `summary` output answers all your questions, or addresses all your hypotheses, and it does that even though the hypotheses involve multiple conflations etc. that would otherwise be very difficult to do. This way of defining orthogonal contrasts can streamline analyses a lot.### Excursus: what if you don't follow rule 5?Let's quickly see why I personally find rule 5 so important, even if most other references I know don't use it. Those other references, I think, try to make developing the contrast matrices simpler by using integers for the contrast vectors rather than the decimals/fractions that my rule 5 here implies. Therefore, they would define the contrast vectors not like I did above (repeated here for convenience) ...```{r}#| eval: false# clos numb open prop unclclo.op_vs_num.prop.unc <-c(+0.6, -0.4, +0.6, -0.4, -0.4) clo_vs_op <-c(+0.5, 0 , -0.5, 0 , 0 ) num.prop_vs_uncl <-c( 0 , -1/3, 0 , -1/3, +2/3) num_vs_prop <-c( 0 , +0.5, 0 , -0.5, 0 )```but like this:```{r}d$CLASS_orth_notrule5 <- d$CLASS# clos numb open prop unclclo.op_vs_num.prop.unc <-c( +3 , -2 , +3 , -2 , -2) clo_vs_op <-c( +1 , 0 , -1 , 0 , 0) num.prop_vs_uncl <-c( 0 , -1 , 0 , -1 , +2) num_vs_prop <-c( 0 , +1 , 0 , -1 , 0)(contrasts(d$CLASS_orth_notrule5) <-# make the following the contrasts of CLASS_orthcbind( # the matrix resulting from binding together as columns clo.op_vs_num.prop.unc, # all clo_vs_op, # four num.prop_vs_uncl, # contrast num_vs_prop)) # vectors```This sure *looks* easier (at first) but that 'backfires' -- let me show you how. Here are the summary results of the model defined with my contrasts, with rule 5 in place:```{r}summary(m_03_orthog)$coef %>%round(4)```And here are the summary results of you go with the 'easier' integer-based matrix:```{r}summary(m_03_orthog_notrule5 <-lm(DURATION_l ~1+ CLASS_orth_notrule5, data=d))$coef %>%round(4)```The reason for my preference becomes clear when you realize that all coefficients (but the intercept, the mean of predicted means) differ from each other and how. Here are the predictions of both models again:```{r}nd_orthog```We have seen above that, when you use my contrasts with rule 5, the coefficients in the summary table correspond to exactly the differences between the (means of grouped) means of contrasted levels. But that's not the case anymore if you use the integer-based coding. Now* the coefficient `CLASS_orth_notrule5clo.op_vs_num.prop.unc` is `r coef(m_03_orthog_notrule5)[2]`, which is exactly ^1^/~5~ of the real difference, which we know is `r coef(m_03_orthog)[2]`;* the coefficient `CLASS_orth_notrule5clo_vs_op` is `r coef(m_03_orthog_notrule5)[3]`, which is exactly ^1^/~2~ of the real difference, which we know is `r coef(m_03_orthog)[3]`;* the coefficient `CLASS_orth_notrule5num.prop_vs_uncl` is `r coef(m_03_orthog_notrule5)[4]`, which is exactly ^1^/~3~ of the real difference, which we know is `r coef(m_03_orthog)[4]`;* the coefficient `CLASS_orth_notrule5num_vs_prop` is `r coef(m_03_orthog_notrule5)[5]`, which is exactly ^1^/~2~ of the real difference, which we know is `r coef(m_03_orthog)[5]`.In other words, using the integers as we did reduces the coefficients by a factor of *f*, where *f* is the sum of the absolute values of the types of a contrast vector. Thus,* if you define `clo.op_vs_num.prop.unc` as `c(3, -2, 3, -2, -2)`, then the sum of the absolute values of the types is 5 and we saw the coefficient for that contrast in the model not following rule 5 is indeed reduced to just ^1^/~5~th of the real difference, but* if you define `clo.op_vs_num.prop.unc` as `c(0.6, -0.4, 0.6, -0.4, -0.4)`, then, as per rule 5, the sum of the absolute values of the types is 1 so the coefficient for that contrast in the model following rule 5 is indeed ^1^/~1~th of the real difference.Thus, yes, of course you can save mental effort and use the integers, but you then need to invest more mental effort, I think, to reconstruct the real differences from the summary table -- better use rule 4 right away.## Exercises### Exercise 1A corpus-based study of four varieties of English -- a predictor called `VARIETY` with the 4 levels *NigE*, *BrE*, *IndE*, and *AmE* (in that order!) -- set out to test the hypotheses that* NigE is significantly different from all others combined;* IndE is significantly different from the two inner-circle varieties AmE and BrE combined;* the two inner-circle varieties AmE and BrE are not significantly different.Formulate the required contrast matrix for this predictor `VARIETY`.```{r}#| eval: false# NIG BRE INDE AMEnig_vs_rest <-c(3/4, -1/4, -1/4, -1/4) # contrast 1 inn_vs_out <-c( 0 , 1/3, -2/3, 1/3) # contrast 2 bre_vs_ame <-c( 0 , 1/2, 0 , -1/2) # contrast 3(contrasts(d$VARIETY) <-cbind( nig_vs_rest, inn_vs_out, bre_vs_ame))```### Exercise 2A corpus-based study of 5 'varieties' of English -- a predictor called `L1` with 5 levels *EN* (i.e. native speakers), *FR*, *GE*, *NO*, and *SPA* (in that order!) -- set out to test the hypotheses that* the native speakers are significantly different from all others combined;* the two Germanic L1s *NO* and *GE* combined are significantly different from the two Romance L1s *FR* and *SPA* combined;* the two Germanic L1s *NO* and *GE* are not significantly different from each other;* the two Romance L1s *FR* and *SPA* are not significantly different from each other.Formulate the required contrast matrix for this predictor `L1`.```{r}#| eval: falsecontrasts(d$L1) <-cbind(# eng fren germ norw spannative_vs_nonnative=c(0.8, -0.2, -0.2, -0.2, -0.2), # contrast 1romance_vs_germanic=c(0 , 0.5, -0.5, -0.5, 0.5), # contrast 2spanish_vs_french=c(0 , -0.5, 0 , 0 , 0.5), # contrast 3norwergian_vs_german=c(0 , 0 , -0.5, 0.5, 0 )) # contrast 4```# Session info```{r}sessionInfo()```