In nearly all models involving numeric predictors so far, we have implicitly adopted the null hypothesis that the relationship between a numeric predictor and a response variable – numeric or categorical – is best, or at least most parsimoniously, captured with a straight line. That would be most parsimonious because it involves the smallest number of parameters: an intercept and a slope rather than also parameters that determine how a line would deviate from a straight line. But for many phenomena a straight line is most likely not the right approach (to put it very mildly). Learning, forgetting, priming, language change, etc. all involve non-linearities that we ideally would try to capture. This session will deal with a few approaches to this:
PR RE
Min. :0.940 Min. :1.880
1st Qu.:2.555 1st Qu.:3.615
Median :4.965 Median :4.725
Mean :4.823 Mean :4.968
3rd Qu.:6.690 3rd Qu.:5.600
Max. :9.310 Max. :9.430
We first compute the deviance of the response and the simplest straight-line model:
deviance(m_00 <-lm(RE ~1, data=d))
[1] 152.3095
summary(m_straight <-lm(RE ~1+ PR, data=d))
Call:
lm(formula = RE ~ 1 + PR, data = d)
Residuals:
Min 1Q Median 3Q Max
-3.1881 -1.1411 -0.0121 1.5548 2.7758
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1550 0.6128 5.149 8.33e-06 ***
PR 0.3758 0.1132 3.322 0.00199 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.762 on 38 degrees of freedom
Multiple R-squared: 0.225, Adjusted R-squared: 0.2046
F-statistic: 11.03 on 1 and 38 DF, p-value: 0.001986
This doesn’t look all bad:
the model/predictor is very significant;
R2 is not all bad;
the residual summary is looking very good with a median very close to 0 and an at least somewhat symmetric distribution.
But the model diagnostics are terrible, as we see in Figure 1:
par(mfrow=c(1,2)) # show theplot(m_straight, which=1:2) # diagnosticpar(mfrow=c(1,1)) # plots
Figure 1: Model-diagnostic plots for m_straight
And if we compute the values that this model predicts …
Figure 2: Observations against predictions for m_straight
Clearly, we need something better, which is what we turn to now.
1.1.2 A better approach
Given Figure 1 and Figure 2, it seems as if there are two trends:
one when d$PR is between roughly 1 and 5;
one when d$PR is between roughly 5 and 9.5.
How would we approach this in a somewhat convoluted way? We might do something like this:
we split d up into two parts depending on whether d$PR is greater than 5 or not;
we compute the deviance of RE in each part of d;
we fit a separate lm regression line for RE ~ PR for each part of d;
we can compute some version of an R2 based on the summed deviance of the two separate models.
Here are steps 1 and 2:
lapply(d.split <-split(d, d$PR>5), summary)
$`FALSE`
PR RE
Min. :0.940 Min. :2.550
1st Qu.:1.775 1st Qu.:3.615
Median :2.500 Median :4.535
Mean :2.715 Mean :4.375
3rd Qu.:3.680 3rd Qu.:5.250
Max. :4.920 Max. :5.830
$`TRUE`
PR RE
Min. :5.010 Min. :1.880
1st Qu.:5.673 1st Qu.:3.538
Median :6.720 Median :5.410
Mean :6.931 Mean :5.560
3rd Qu.:8.262 3rd Qu.:8.160
Max. :9.310 Max. :9.430
Note how the combined deviance of the two models (138.2791) is already lower than the initial null deviance (152.3095) because we’re now using two means for prediction instead of just one for the first null model on all the data. Now we do step 3:
Clearly, the slopes are both significantly different from each other and from 0.
The better way, the one that also gets us actual p-values, would of course involve multcomp::glht. The following checks whether each slope is significantly different from the other:
summary(glht( # show the summary of a glht, namely m_PRlt5, # check whether the sum of (these coeffs of m_PRlt5matrix(c(0,1), nrow=1), # multiplied by this)rhs=1.82)) # is significantly different from 1.82
Simultaneous Tests for General Linear Hypotheses
Fit: lm(formula = RE ~ 1 + PR, data = d.split[["FALSE"]])
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
1 == 1.82 -0.72961 0.06072 -41.99 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
summary(glht( # show the summary of a glht, namely m_PRgt5, # check whether the sum of (these coeffs of m_PRgt5matrix(c(0,1), nrow=1), # multiplied by this)rhs=-0.73)) # is significantly different from -0.73
Simultaneous Tests for General Linear Hypotheses
Fit: lm(formula = RE ~ 1 + PR, data = d.split[["TRUE"]])
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
1 == -0.73 1.81954 0.09129 27.93 3.33e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
… and this now checks whether the slopes are significantly different from 0:
But this is not how we should do this for two reasons. The first of these is that we seem to have a situation where we can say ‘the effect of PR is not the same everywhere’, which means we should not go with two separate models, but one, namely one that involves an interaction. Interestingly, the interaction here is kind of like a numeric variable interacting with a binary version of itself!
1.1.2.1 The interaction
Thus, we start by defining a variable PR_bin and add it to the data frame d:
PR PR_bin RE
Min. :0.940 ltbp:20 Min. :1.880
1st Qu.:2.555 gtbp:20 1st Qu.:3.615
Median :4.965 Median :4.725
Mean :4.823 Mean :4.968
3rd Qu.:6.690 3rd Qu.:5.600
Max. :9.310 Max. :9.430
Then we fit a new model that includes the ‘old’ numeric predictor but now also its binary version ,as well as the interaction of the two and plot its diagnostics:
summary(m_breakp <-lm( # fit and RE ~1+# summarize the PR + PR_bin + PR:PR_bin, # model w/ thedata=d)) # new predictors
Call:
lm(formula = RE ~ 1 + PR + PR_bin + PR:PR_bin, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.2084 -0.1990 -0.0547 0.2820 1.0453
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.35638 0.24298 26.160 < 2e-16 ***
PR -0.72961 0.08155 -8.947 1.11e-10 ***
PR_bingtbp -13.40758 0.58822 -22.794 < 2e-16 ***
PR:PR_bingtbp 2.54914 0.11143 22.878 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4476 on 36 degrees of freedom
Multiple R-squared: 0.9527, Adjusted R-squared: 0.9487
F-statistic: 241.5 on 3 and 36 DF, p-value: < 2.2e-16
deviance(m_breakp) # same as sum_of_devs
[1] 7.211284
par(mfrow=c(1,2)) # show theplot(m_breakp, which=1:2) # diagnosticpar(mfrow=c(1,1)) # plots
Figure 3: Model-diagnostic plots for m_breakp w/ a breakpoint of 5
See how well this model does? Look at both the range of x-axis values of the first diagnostic plot and the relative lack of structure in it!) It has the same deviance as what before computed for sum_of_devs from the deviances of m_PRlt5 and m_PRgt5, a very high multiple R2, and the interaction is significant, which shows that
the slope of PR in m_PRlt5 (-0.7296054) and
the slope of PR in m_PRgt5 (1.8195353)
are indeed significantly different. Plus, obviously, the diagnostics of this single model now look much better than they did for m_straight, which becomes clearest if we compare the fitted-vs-.-residuals plots of both models in one panel like in Figure 4:
Figure 4: Model-diagnostic plots for m_straight vs. m_breakp
How might we plot this best? The usually best strategy – go with effects – isn’t too great here:
plot(effects::allEffects(m_breakp, grid=TRUE))
We want something better. For that, we go back to our data frame nd that so far only contains a sequence of values covering the predictor PR and the predictions of m_straight and
add a new column that states for each value in that sequence which value of PR_bin it would come with;
add new columns with the predictions of our model m_breakp for these pairs of values and their lower and upper 95% CI:
Figure 5: Observations against predictions for m_breakp w/ a breakpoint of 5
1.1.2.2 The breakpoint
But this addressed ‘only’ the first problem – the second is, how do we even know the best breakpoint is indeed 5? We just went with a visual heuristic but didn’t really explore this properly, which is what we will do now. For this, we could do the following:
we determine all possible breakpoints at which d$PR could be split into two values;
for each possible breakpoint, we fit a version of m_breakp and collect its deviance;
the best breakpoint is the model with the smallest deviance, which is the ‘final model’ we would then adopt.
Here’s step 1: We determine all unique values of d$PR and the possible breakpoints between them:
unique_PR_values <- d$PR %>% unique %>% sort # sort unique values)possible_breakpoints <-setNames( # make possible_breakpoints# the middles between each value & the next (unique_PR_values[-1] + unique_PR_values[-length(unique_PR_values)]) /2,# & assign as names the lower and the upper values of each pair:paste(unique_PR_values[1:35], unique_PR_values[2:36], sep="-"))possible_breakpoints
for (i inseq(possible_breakpoints)) { # loop over the breakpoints d$PR_bin <-factor(ifelse( # create a binary predictor at breakpoint d$PR>possible_breakpoints[i], "gtbp", "ltbp")) # w/ these levels# collect the deviance when that predictor is used in the model: possible_deviances[i] <-lm(RE ~1+ PR*PR_bin, data=d) %>% deviance}
Step 3: what’s the best breakpoint? The plot shows that our guesstimate of 5 really wasn’t all that bad, yet still not optimal:
So now we fit our model again, but now with this best breakpoint of 5.995 and we plot diagnostics of it again, too:
d$PR_bin <-factor(ifelse(d$PR>bestbreakpoint, ">bp", "<bp"))summary(m_bestbreakp <-lm( # fit and RE ~1+# summarize the PR + PR_bin + PR:PR_bin, # model w/ thedata=d)) # new predictors
Call:
lm(formula = RE ~ 1 + PR + PR_bin + PR:PR_bin, data = d)
Residuals:
Min 1Q Median 3Q Max
-0.85011 -0.14936 -0.09176 0.26467 1.27806
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.28409 0.17690 35.52 <2e-16 ***
PR -0.69823 0.04802 -14.54 <2e-16 ***
PR_bin>bp -12.11922 0.79722 -15.20 <2e-16 ***
PR:PR_bin>bp 2.36978 0.11234 21.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3835 on 36 degrees of freedom
Multiple R-squared: 0.9652, Adjusted R-squared: 0.9623
F-statistic: 333.2 on 3 and 36 DF, p-value: < 2.2e-16
deviance(m_bestbreakp)
[1] 5.294669
par(mfrow=c(1,2)) # show theplot(m_bestbreakp, which=1:2) # diagnosticpar(mfrow=c(1,1)) # plots
Figure 7: Model-diagnostic plots for m_breakp w/ the best breakpoint of 6
The R2-value of this model with the best breakpoint is a bit higher than the one with our initial heuristic breakpoint, and of course this model is significantly better than the straight-line model:
anova(m_straight, m_breakp , test="F")
Analysis of Variance Table
Model 1: RE ~ 1 + PR
Model 2: RE ~ 1 + PR + PR_bin + PR:PR_bin
Res.Df RSS Df Sum of Sq F Pr(>F)
1 38 118.038
2 36 7.211 2 110.83 276.63 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m_straight, m_bestbreakp, test="F")
Analysis of Variance Table
Model 1: RE ~ 1 + PR
Model 2: RE ~ 1 + PR + PR_bin + PR:PR_bin
Res.Df RSS Df Sum of Sq F Pr(>F)
1 38 118.038
2 36 5.295 2 112.74 383.29 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
c("AIC of m_straight"=AIC(m_straight), "AIC of m_breakp"=AIC(m_breakp), "AIC of m_bestbreakp"=AIC(m_bestbreakp))
AIC of m_straight AIC of m_breakp AIC of m_bestbreakp
162.80012 54.98579 42.62793
So let’s generate a new plot, for which we first compute the new predictions …
Then we plot this in the same way as before. (Since we will add more info to this plot multiple times later, I put the plotting code into a function to save ‘code space’ below.)
Figure 8: Observations against predictions (with heuristic breakpoint)
Not a huge difference to the breakpoint model were we just assumed the breakpoint was 5 – probably none when it comes to interpretation, in fact – but this is ‘objectively’ the best model (with a breakpoint) we could get for this data.
1.2 Polynomials
1.2.1 A first example
One possible alternative would be to not posit a clear breakpoint that separates two straight lines but to assume one line with some amount of curvature. The maybe simplest way to do this would be to use polynomial regression kind of approach, in which the numeric predictor is entered in the rhs as a polynomial to the n-th degree, with n-1 being, in a sense, the number of different curves/turns.
The straight-line model we fit above is this:
summary(m_straight)
Call:
lm(formula = RE ~ 1 + PR, data = d)
Residuals:
Min 1Q Median 3Q Max
-3.1881 -1.1411 -0.0121 1.5548 2.7758
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1550 0.6128 5.149 8.33e-06 ***
PR 0.3758 0.1132 3.322 0.00199 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.762 on 38 degrees of freedom
Multiple R-squared: 0.225, Adjusted R-squared: 0.2046
F-statistic: 11.03 on 1 and 38 DF, p-value: 0.001986
deviance(m_straight)
[1] 118.0379
A model with a polynomial allowing for one curve could be written like this:
summary(m_polyorth <-lm( RE ~1+poly(PR, 2), # <---data=d))
Call:
lm(formula = RE ~ 1 + poly(PR, 2), data = d)
Residuals:
Min 1Q Median 3Q Max
-1.6437 -0.3065 0.1777 0.4188 1.1970
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.9678 0.1106 44.92 < 2e-16 ***
poly(PR, 2)1 5.8542 0.6994 8.37 4.62e-10 ***
poly(PR, 2)2 9.9969 0.6994 14.29 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6994 on 37 degrees of freedom
Multiple R-squared: 0.8812, Adjusted R-squared: 0.8747
F-statistic: 137.2 on 2 and 37 DF, p-value: < 2.2e-16
deviance(m_polyorth)
[1] 18.09907
Certainly a massive improvement over the straight-line model! That means the curvature pretty much has to be significant, and it is; note: the F-value of the anova comparison is the square of the t-value for poly(PR, 2)2 in the summary output.
anova(m_straight, m_polyorth, test="F")
Analysis of Variance Table
Model 1: RE ~ 1 + PR
Model 2: RE ~ 1 + poly(PR, 2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 38 118.038
2 37 18.099 1 99.939 204.31 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
c("AIC of m_straight"=AIC(m_straight), "AIC of m_polyorth"=AIC(m_polyorth))
AIC of m_straight AIC of m_polyorth
162.80012 89.79432
What do the diagnostics say? They are not as good as those of m_bestbreakp, but certainly much better than the straight-line version m_straight:
par(mfrow=c(1,2)) # show theplot(m_polyorth, which=1:2) # diagnosticpar(mfrow=c(1,1)) # plots
Figure 9: Model-diagnostic plots for m_polyorth
What does the fit look like? Figure 10 indicates it’s quite good though possibly not as good as the model with a breakpoint:
Figure 10: Observations against predictions (m_straight & m_poly)
How could we check whether the model with the best breakpoint m_bestbreakp is indeed better than the polynomial m_polyorth? Because we can’t use anova …
c("AIC of m_bestbreakp"=AIC(m_bestbreakp), "AIC of m_polyorth"=AIC(m_polyorth))
AIC of m_bestbreakp AIC of m_polyorth
42.62793 89.79432
AIC shows that the model with the best breakpoint is much better than the polynomial!
But (i) how do we interpret the coefficients in the summary output of m_polyorth? And (ii) how does the polynomial does what it does? The short answer to (i) is, you don’t. The normal kind of polynomial coefficients we created here is pretty much uninterpretable and interpretation of the results is based on the visualization of the predictions as we have done before. This is because of the answer to (ii). Here are the model matrices for both models; the one for m_straight, fine, but the one for m_polyorth is not intuitive at all.
One reason why it’s unintuitive is that the default polynomials in R are orthogonal polynomials, with orthogonal meaning they are transformed such that they end up not being linearly correlated with each other (which should remind you of collinearity):
But to appreciate the logic of polynomials a bit more, we can look at a non-orthogonalized version of these polynomials, which is a bit easier to comprehend; we define polynomials again but now use the additional argument raw=TRUE:
summary(m_polyraw <-lm( RE ~1+poly(PR, 2, raw=TRUE), # <---data=d))
Call:
lm(formula = RE ~ 1 + poly(PR, 2, raw = TRUE), data = d)
Residuals:
Min 1Q Median 3Q Max
-1.6437 -0.3065 0.1777 0.4188 1.1970
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.23176 0.43044 19.12 < 2e-16 ***
poly(PR, 2, raw = TRUE)1 -2.38307 0.19817 -12.03 2.4e-14 ***
poly(PR, 2, raw = TRUE)2 0.28062 0.01963 14.29 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6994 on 37 degrees of freedom
Multiple R-squared: 0.8812, Adjusted R-squared: 0.8747
F-statistic: 137.2 on 2 and 37 DF, p-value: < 2.2e-16
deviance(m_polyraw)
[1] 18.09907
You can see that everything stays the same – R2, F, residuals, … – so in that sense the model m_polyraw is equivalent to the model m_polyorth, but now check out the model matrix:
… the default way in which poly proceeds is to orthogonalize these two columns, which we have seen eliminates the high correlation between the two columns in the model matrix, but also is what makes the coefficients uninterpretable.
One natural follow-up question now is when to use regression with breakpoints and when a polynomial approach. Sometimes, you might be able to decide on the basis of fit and we have seen that the fit of the model with a breakpoint is much better than the polynomial model. However, sometimes you might also decide on the basis of the conceptual logic and hypotheses that underpin the analysis (maybe a breakpoint’s location follows perfectly well from previous work or it is easier, or harder, to motivate a single curved causal mechanism) – in other words, ‘it depends’.
1.2.2 A more fine-grained example
Let’s look at this with a simpler example, but in more detail. We take the reaction time data from the book, where the expectations were that
FAM would be negatively correlated with RT: the more familiar, the faster one reacts to the word;
FREQ would be negatively correlated with RT: the more frequent, the faster one reacts to the word.
We first log FREQ, …
Code
summary(x <-read.delim( # summarize d, the result of loadingfile="_input/reactiontimes.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
CASE RT FREQ FAM IMA
almond : 1 Min. :523.0 Min. : 1.00 hi :12 hi :30
ant : 1 1st Qu.:589.4 1st Qu.: 2.00 lo :12 lo :23
apple : 1 Median :617.7 Median : 4.00 med :31 NA's:24
apricot : 1 Mean :631.9 Mean : 9.26 NA's:22
asparagus: 1 3rd Qu.:663.6 3rd Qu.: 10.00
avocado : 1 Max. :794.5 Max. :118.00
(Other) :71
MEAN
Min. :315.0
1st Qu.:409.8
Median :437.5
Mean :436.0
3rd Qu.:466.2
Max. :553.0
NA's :29
Code
x <-droplevels(x[complete.cases(x[,c("RT", "FREQ", "FAM")]),])x$FREQ <-log2(x$FREQ)
… and then apply a backwards model selection process to the initial model RT ~ poly(FREQ, 2) * FAM to identify a final model. Here’s the initial model:
anova(m_01, m_02a, test="F") %>% data.frame # rep for convenience
Res.Df RSS Df Sum.of.Sq F Pr..F.
1 46 87814.86 NA NA NA NA
2 50 97649.38 -4 -9834.521 1.287903 0.2886233
anova(m_01, m_02b, test="F") %>% data.frame
Res.Df RSS Df Sum.of.Sq F Pr..F.
1 46 87814.86 NA NA NA NA
2 49 97527.67 -3 -9712.808 1.695951 0.1810154
Given these results, we go with m_02a because it is ‘less significantly different’ from m_01 and has a lower AICc-value than m_01 and m_02b. We continue:
drop1(m_02a, test="F") %>% data.frame
Df Sum.of.Sq RSS AIC F.value Pr..F.
<none> NA NA 97649.38 421.4993 NA NA
poly(FREQ, 2) 2 11644.65 109294.02 423.6955 2.981239 0.059817286
FAM 2 19977.07 117626.45 427.7365 5.114489 0.009530436
Given this result of drop1, you are likely tempted to do this next:
summary(m_03 <-update(m_02a, .~.-poly(FREQ, 2))) # remove what drop1 said is ns
Call:
lm(formula = RT ~ FAM, data = x)
Residuals:
Min 1Q Median 3Q Max
-84.370 -24.814 -3.342 22.482 131.164
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 591.39 13.23 44.685 < 2e-16 ***
FAMlo 71.90 18.72 3.842 0.000334 ***
FAMmed 22.26 15.59 1.428 0.159250
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 45.85 on 52 degrees of freedom
Multiple R-squared: 0.2349, Adjusted R-squared: 0.2055
F-statistic: 7.982 on 2 and 52 DF, p-value: 0.0009481
anova(m_02a, m_03, test="F") %>% data.frame # same as drop1
Res.Df RSS Df Sum.of.Sq F Pr..F.
1 50 97649.38 NA NA NA NA
2 52 109294.02 -2 -11644.65 2.981239 0.05981729
But note that the ns term returned by drop1 contains 2 ‘elements’, so to speak:
we dropped a linear component of frequency (column 2 of mm_02a), and
we dropped a squared component of frequency (column 3 of mm_02a),
which means we lost 2 (!) columns in the model matrix:
Res.Df RSS Df Sum.of.Sq F Pr..F.
1 50 97649.38 NA NA NA NA
2 51 101358.58 -1 -3709.203 1.899245 0.1742991
anova(m_02a, m_03b, test="F") %>% data.frame
Res.Df RSS Df Sum.of.Sq F Pr..F.
1 50 97649.38 NA NA NA NA
2 51 105804.64 -1 -8155.26 4.175787 0.04629252
We go with m_03a because it is not significantly different from m_02a and has a lower AICc-value than the other models:
drop1(m_03a, test="F") %>% data.frame
Df Sum.of.Sq RSS AIC F.value Pr..F.
<none> NA NA 101358.6 421.5498 NA NA
FAM 2 18303.144 119661.7 426.6800 4.604743 0.01450722
FREQ 1 7935.443 109294.0 423.6955 3.992830 0.05103968
And now we’re done because …? We said we expected that “FREQ would be negatively correlated with RT”, which it is, which means we can use a one-tailed p-value: instead of 0.0510397 we get 0.0255198, which is significant.
summary(m_03a)
Call:
lm(formula = RT ~ FAM + FREQ, data = x)
Residuals:
Min 1Q Median 3Q Max
-76.072 -28.680 -1.306 24.454 119.161
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 621.433 19.792 31.399 <2e-16 ***
FAMlo 53.858 20.317 2.651 0.0107 *
FAMmed 10.531 16.254 0.648 0.5199
FREQ -7.854 3.930 -1.998 0.0510 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 44.58 on 51 degrees of freedom
Multiple R-squared: 0.2904, Adjusted R-squared: 0.2487
F-statistic: 6.959 on 3 and 51 DF, p-value: 0.0005159
Bottom line: You may not always have to go through all the lengths shown here, but be aware of what you’re dropping/doing! Look, this is how complex it can get:
There are other, more sophisticated ways to model curvature. Polynomials are a good first shot and I nearly always try them first but they are sometimes not sensitive enough. One alternative is what are called splines (usually “natural cubic splines”), and in the simplest case they can be fit fairly easily like this:
library(splines)summary(m_spline_2 <-lm( RE ~1+ns(PR, knots=2),data=d))
Call:
lm(formula = RE ~ 1 + ns(PR, knots = 2), data = d)
Residuals:
Min 1Q Median 3Q Max
-1.8326 -0.2414 0.1850 0.5269 1.4625
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.5026 0.3264 19.921 < 2e-16 ***
ns(PR, knots = 2)1 -5.2920 0.7510 -7.046 2.46e-08 ***
ns(PR, knots = 2)2 5.2024 0.4106 12.670 5.01e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7686 on 37 degrees of freedom
Multiple R-squared: 0.8565, Adjusted R-squared: 0.8487
F-statistic: 110.4 on 2 and 37 DF, p-value: 2.524e-16
While the coefficients are again uninterpretable, this seems very similar to the quality of the polynomial model, but the polynomial is in fact a bit better, which is confirmed by the AIC-values as well:
AIC(m_spline_2)
[1] 97.34124
AIC(m_polyorth)
[1] 89.79432
Let’s compare the predictions of both; in Figure 11, we add the new result to all previous ones:
Figure 11: Observations against predictions (m_straight, m_bestbreakpoint, m_poly, m_splines)
For the sake of interpreting this, I am not sure the statistical difference here between the polynomial and the splines could be important conceptually, theoretically, or in any applied sense.
The final way to look at curvature is the most powerful but also the by far most complex one involves a method called generalized additive models (gams). On the surface, using this method is deceptively simple; using the package mgcv, the approach looks very much like the splines one:
library(mgcv)summary(m_gam <-gam( RE ~1+s(PR), data=d))
Variance explanation is excellent: R2 is greater than 0.95, but R2 of m_bestbreakp was a tiny bit better – AIC has a slight preference for m_gam:
AIC(m_breakp)
[1] 54.98579
AIC(m_gam)
[1] 53.65669
Also, we can also see that the smooth term – the thing responsible for the curvature – is highly significant and involves a degree of curvature that is equivalent to 6 knots (edf, for ‘empirical degrees of freedom’, the squiggliness the gam estimated from the data). But as before, these numerical results don’t tell us anything about interpretation, so we plot again, this time into Figure 12 (and without a function now because this will be the last plot):
Figure 12: Observations against predictions (all models)
The results of the gam clearly support the regression with breakpoints – although, the big turning point of the gam is again closer to PR=5 … – and I leave it up to you to verify that diagnostics such as predicted-vs.-fitted look fine. You might therefore of course wonder why one wouldn’t always use a gam – the answer is what I alluded to above: gams are indeed extremely powerful and in the best of worlds we’d use them much more often, but the downside is that the complexity that comes with them makes them often really hard to apply. And in this rarest of cases, the gam is not consistently better than the regression with the best breakpoint (and any interpretation of tha gam’s result would probably be identical to those of the model with the best breakpoint.
1.4 Exercises
1.4.1 Exercise 1
Take the duration data, log DURATION as before, square root SURPRISAL as before, …
… and then determine whether the relationship between DURATION.l and SURPRISAL.s is better captured with a straight line or a polynomial to the second degree.
Hilpert (2011) discusses the change in the frequency of the keep V-ing construction in 18 years of the TIME Magazine corpus; the data are in _input/keepVing.csv:
YEAR CORPSIZE TOKENS TOKpmw
Min. :1990 Min. :20098563 Min. :1507 Min. :73.02
1st Qu.:1994 1st Qu.:20521914 1st Qu.:1786 1st Qu.:87.15
Median :1998 Median :20626230 Median :1834 Median :89.11
Mean :1998 Mean :20597131 Mean :1828 Mean :88.76
3rd Qu.:2003 3rd Qu.:20742853 3rd Qu.:1916 3rd Qu.:92.88
Max. :2007 Max. :20993988 Max. :1982 Max. :96.18
You can see what the variables/columns contain in the file _input/keepVing.r. Fit a nice regression to model the normalized frequency of the construction over time.
Call:
lm(formula = TOKpmw ~ 1 + YEAR, data = d)
Residuals:
Min 1Q Median 3Q Max
-10.153 -2.800 1.169 2.872 7.050
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1400.4890 427.3645 -3.277 0.00474 **
YEAR 0.7452 0.2138 3.485 0.00306 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.707 on 16 degrees of freedom
Multiple R-squared: 0.4315, Adjusted R-squared: 0.3959
F-statistic: 12.14 on 1 and 16 DF, p-value: 0.003061
abline(m_straight, col="red")
# show the diagnostic plots:par(mfrow=c(1,2))plot(m_straight, which=1:2)
par(mfrow=c(1,1))# regression with breakpointspossible_breakpoints <-setNames( # make possible_breakpoints# the middles between each value & the next (d$YEAR[-1] + d$YEAR[-length(d$YEAR)]) /2,# & assign as names the lower and the upper values of each pair:paste(d$YEAR[1:17], d$YEAR[2:18], sep="-"))possible_deviances <-setNames(rep(NA, length(possible_breakpoints)), possible_breakpoints)for (i inseq(possible_breakpoints)) { # loop over the breakpoints d$YEAR_bin <-factor(ifelse( # create a binary predictor at breakpoint d$YEAR>possible_breakpoints[i], "gtbp", "ltbp")) # w/ these levels# collect the deviance when that predictor is used in the model: possible_deviances[i] <-lm(TOKpmw ~1+ YEAR*YEAR_bin, data=d) %>% deviance}sort(possible_deviances)
breakpoint <-1996.5# I am not picking the best one here -- why?plot(type="l",xlim=c(1990, 2007), x=possible_breakpoints,ylim=c(0, 350), y=possible_deviances); grid()arrows(breakpoint, 250, breakpoint, possible_deviances[as.character(breakpoint)])text(breakpoint, 250, breakpoint, pos=3)
d$YEAR_bin <-factor(ifelse(d$YEAR>breakpoint, "gtbp", "ltbp"), levels=c("ltbp", "gtbp"))summary(m_breakp <-lm( # fit and TOKpmw ~1+# summarize the YEAR + YEAR_bin + YEAR:YEAR_bin, # model w/ thedata=d)) # new predictors
Call:
lm(formula = TOKpmw ~ 1 + YEAR + YEAR_bin + YEAR:YEAR_bin, data = d)
Residuals:
Min 1Q Median 3Q Max
-5.5289 -2.5491 0.6309 2.3026 5.0605
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5895.2245 1300.8636 -4.532 0.000470 ***
YEAR 3.0004 0.6527 4.597 0.000415 ***
YEAR_bingtbp 5774.2274 1458.3893 3.959 0.001425 **
YEAR:YEAR_bingtbp -2.8943 0.7311 -3.959 0.001427 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.454 on 14 degrees of freedom
Multiple R-squared: 0.7322, Adjusted R-squared: 0.6748
F-statistic: 12.76 on 3 and 14 DF, p-value: 0.0002718
# very small loss of R-squared for the not quite optimal breakpointpar(mfrow=c(1,2)) # show theplot(m_breakp, which=1:2) # diagnostic
par(mfrow=c(1,1)) # plots# still not perfect (but this is 18 data points ...)anova(m_straight, m_breakp, test="F")
Analysis of Variance Table
Model 1: TOKpmw ~ 1 + YEAR
Model 2: TOKpmw ~ 1 + YEAR + YEAR_bin + YEAR:YEAR_bin
Res.Df RSS Df Sum of Sq F Pr(>F)
1 16 354.49
2 14 167.01 2 187.48 7.8581 0.005152 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(m_breakp); AIC(m_gam) # gam is better, but might be too fine-grained!
[1] 101.1797
[1] 98.8352
# note how I compute the confidence interval for the gam predictions:qwe <-predict(m_gam, newdata=d, se.fit=TRUE) # w/ std. errors ...d <-cbind(d, "fit.gam"=qwe$fit, # ... from the"lwr.gam"=qwe$fit+qnorm(0.025)*qwe$se.fit, # normal"upr.gam"=qwe$fit+qnorm(0.975)*qwe$se.fit) # distributionplot1(); lines(d$YEAR, d$fit.gam, col="blue", lwd=3)polygon(border=NA, col="#0000FF10",c(d$YEAR, rev(d$YEAR)), c(d$lwr.gam, rev(d$upr.gam)))
Additional question: if we go with m_breakp here: What are the slopes of YEAR before and after the breakpoint, i.e. during the ascent of the construction and when it seems to level off? Does it really
‘ascend’ (in the sense that the slope of YEAR before the breakpoint is indeed significantly greater than 0)?
’level off (in the sense that the slope of YEAR after the breakpoint is indeed not significantly different from 0 anymore)?
In other words, what are the two slopes of YEAR before and after the breakpoint and are they significantly different from 0. (We know they are significantly different from each other from the model summary.) If this looks like a glht question to you, good for you!
---title: "Ling 204: session 04"author: - name: "[Stefan Th. Gries](https://www.stgries.info)" affiliation: - UC Santa Barbara - JLU Giessen orcid: 0000-0002-6497-3958date: "2026-01-28 12:34:56 PDT"date-format: "DD MMM YYYY HH-mm-ss"editor: sourceformat: html: page-layout: full code-fold: false code-link: true code-copy: true code-tools: true code-line-numbers: true code-overflow: scroll number-sections: true smooth-scroll: true toc: true toc-depth: 4 number-depth: 4 toc-location: left monofont: lucida console tbl-cap-location: top fig-cap-location: bottom fig-width: 4 fig-height: 4 fig-format: png fig-dpi: 300 fig-align: center embed-resources: true link-external-newwindow: trueexecute: cache: false echo: true eval: true warning: false---# Session 04: Curvature {#sec-s04}In nearly all models involving numeric predictors so far, we have implicitly adopted the null hypothesis that the relationship between a numeric predictor and a response variable -- numeric or categorical -- is best, or at least most parsimoniously, captured with a straight line. That would be most parsimonious because it involves the smallest number of parameters: an intercept and a slope rather than also parameters that determine how a line would deviate from a straight line. But for many phenomena a straight line is most likely not the right approach (to put it very mildly). Learning, forgetting, priming, language change, etc. all involve non-linearities that we ideally would try to capture. This session will deal with a few approaches to this:* regression with breakpoints in @sec-s04_regrbreak;* regression involving polynomials in @sec-s04_polynom;* more flexible approach such as splines and generalized additive models in @sec-s04_splinesgams.## Regression w/ breakpoints {#sec-s04_regrbreak}### A simple linear modelLet's take a small data set and fit a simple linear model on it:```{r}rm(list=ls(all.names=TRUE))pkgs2load <-c("car", "effects", "magrittr", "multcomp", "MuMIn")suppressMessages(sapply(pkgs2load, library,character.only=TRUE, logical.return=TRUE, quietly=TRUE)); rm(pkgs2load)summary(d <-data.frame(PR=c(0.98, 0.98, 1.52, 2.14, 2.61, 3.05, 3.65, 4.52, 5.09, 5.71, 5.51, 5.52, 4.92, 4.38,3.77, 3.39, 2.39, 2.39, 1.86, 1.41, 0.94, 1.92, 3.32, 4.16, 5.01, 5.56, 6.28, 6.66,7.26, 8.22, 8.71, 9.31, 8.79, 8.39, 8.39, 7.77, 7.02, 6.78, 6.32, 6.32),RE=c(5.48, 5.48, 4.71, 4.46, 4.11, 4.03, 3.64, 2.55, 1.88, 2.13, 2.27, 2.27, 3.08, 3.08,3.54, 3.78, 4.88, 4.88, 5.31, 5.83, 5.58, 5.23, 4.61, 3.25, 3.11, 3.68, 4.74, 5.66,6.51, 8.21, 9.01, 9.43, 8.83, 8.16, 8.16, 7.07, 5.41, 5.41, 4.63, 4.63)))```We first compute the deviance of the response and the simplest straight-line model:```{r}deviance(m_00 <-lm(RE ~1, data=d))summary(m_straight <-lm(RE ~1+ PR, data=d))```This doesn't look all bad:* the model/predictor is very significant;* *R*^2^ is not all bad;* the residual summary is looking very good with a median very close to 0 and an at least somewhat symmetric distribution.But the model diagnostics are terrible, as we see in @fig-mstraightdiagn:```{r}#| label: fig-mstraightdiagn#| fig-cap: Model-diagnostic plots for m_straightpar(mfrow=c(1,2)) # show theplot(m_straight, which=1:2) # diagnosticpar(mfrow=c(1,1)) # plots```And if we compute the values that this model predicts ...```{r}nd <-data.frame(PR=seq(0, 10, 0.05))nd <-cbind(nd, predict(m_straight, newdata=nd, interval="confidence"))colnames(nd)[2:4] <-c("fit.str", "lwr.str", "upr.str")nd %>% headnd %>% tail```... and then plot those together with the observed values into @fig-mstraightobspred, we can see why the diagnostics say the model is terrible:```{r}#| label: fig-mstraightobspred#| fig-cap: Observations against predictions for m_straightplot(d$PR, d$RE, pch=16); grid()lines(x=nd$PR, y=nd$fit.str, lty=2)polygon(col="#00000020", border=NA,c(nd$PR, rev(nd$PR)),c(nd$lwr.str, rev(nd$upr.str)))```Clearly, we need something better, which is what we turn to now.### A better approachGiven @fig-mstraightdiagn and @fig-mstraightobspred, it seems as if there are two trends:* one when `d$PR` is between roughly 1 and 5;* one when `d$PR` is between roughly 5 and 9.5.How would we approach this in a somewhat convoluted way? We might do something like this:1. we split `d` up into two parts depending on whether `d$PR` is greater than 5 or not;2. we compute the deviance of `RE` in each part of `d`;3. we fit a separate `lm` regression line for `RE ~ PR` for each part of `d`;4. we can compute some version of an *R*^2^ based on the summed deviance of the two separate models.Here are steps 1 and 2:```{r}lapply(d.split <-split(d, d$PR>5), summary)"+"(deviance(lm(RE ~1, data=d.split[["FALSE"]])), # 18.0321deviance(lm(RE ~1, data=d.split[["TRUE"]]))) # 120.247```Note how the combined deviance of the two models (138.2791) is already lower than the initial null deviance (152.3095) because we're now using two means for prediction instead of just one for the first null model on all the data. Now we do step 3:```{r}summary(m_PRlt5 <-lm(RE ~1+ PR, data=d.split[["FALSE"]]))summary(m_PRgt5 <-lm(RE ~1+ PR, data=d.split[["TRUE"]]))```Very nice *R*^2^s! And, step 4:```{r}(sum_of_devs <-deviance(m_PRlt5) +deviance(m_PRgt5))```Which means, the two models account for ≈95.3% of the overall null deviance:```{r}(152.3095- sum_of_devs) /152.3095```Also, we can check whether the slopes of `PR` are significantly different from each other by checking that the confidence intervals* don't overlap with each other;* don't include 0:```{r}Confint(m_PRlt5)Confint(m_PRgt5)```Clearly, the slopes are both significantly different from each other and from 0.The better way, the one that also gets us actual *p*-values, would of course involve `multcomp::glht`. The following checks whether each slope is significantly different from the other:```{r}summary(glht( # show the summary of a glht, namely m_PRlt5, # check whether the sum of (these coeffs of m_PRlt5matrix(c(0,1), nrow=1), # multiplied by this)rhs=1.82)) # is significantly different from 1.82summary(glht( # show the summary of a glht, namely m_PRgt5, # check whether the sum of (these coeffs of m_PRgt5matrix(c(0,1), nrow=1), # multiplied by this)rhs=-0.73)) # is significantly different from -0.73```... and this now checks whether the slopes are significantly different from 0:```{r}glht(m_PRlt5, matrix(c(0, 1), nrow=1)) %>% { list("CI"=confint(.)$confint,"P"=summary(.)$test$pvalues[1]) }glht(m_PRgt5, matrix(c(0, 1), nrow=1)) %>% { list("CI"=confint(.)$confint,"P"=summary(.)$test$pvalues[1]) }```But this is not how we should do this for two reasons. The first of these is that we seem to have a situation where we can say 'the effect of `PR` is not the same everywhere', which means **we should not go with two separate models, but one**, namely one that involves an interaction. Interestingly, the interaction here is kind of like a numeric variable interacting with a binary version of itself!#### The interactionThus, we start by defining a variable `PR_bin` and add it to the data frame `d`:```{r}if(!any(d$PR==5)) { breakpoint <-5 }d$PR_bin <-factor(ifelse(d$PR>breakpoint, "gtbp", "ltbp"),levels=c("ltbp", "gtbp"))summary(d <- d[,c(1,3,2)])```Then we fit a new model that includes the 'old' numeric predictor but now also its binary version ,as well as the interaction of the two and plot its diagnostics:```{r}#| label: fig-mbreakp5diagn#| fig-cap: Model-diagnostic plots for m_breakp w/ a breakpoint of 5summary(m_breakp <-lm( # fit and RE ~1+# summarize the PR + PR_bin + PR:PR_bin, # model w/ thedata=d)) # new predictorsdeviance(m_breakp) # same as sum_of_devspar(mfrow=c(1,2)) # show theplot(m_breakp, which=1:2) # diagnosticpar(mfrow=c(1,1)) # plots```See how well this model does? Look at both the range of *x*-axis values of the first diagnostic plot and the relative lack of structure in it!) It has the same deviance as what before computed for `sum_of_devs` from the deviances of `m_PRlt5` and `m_PRgt5`, a very high multiple *R*^2^, and the interaction is significant, which shows that* the slope of `PR` in `m_PRlt5` (`r coef(m_PRlt5)[2]`) and* the slope of `PR` in `m_PRgt5` (`r coef(m_PRgt5)[2]`)are indeed significantly different. Plus, obviously, the diagnostics of this single model now look much better than they did for `m_straight`, which becomes clearest if we compare the fitted-vs-.-residuals plots of both models in one panel like in @fig-diagn4straightVSbreakpoints:```{r}#| label: fig-diagn4straightVSbreakpoints#| fig-cap: Model-diagnostic plots for m_straight vs. m_breakpplot(pch=16, col="#00000030",xlab="Fitted values", xlim=c( 3, 7) , x=predict(m_straight),ylab="Residuals" , ylim=c(-3.5, 3.5), y=residuals(m_straight)); grid()points(x=predict(m_breakp), y=residuals(m_breakp), pch=16)text(6 , 2.5, "m_straight", col="#00000030")text(6.5, -0.5, "m_breakp")```How might we plot this best? The usually best strategy -- go with `effects` -- isn't too great here:```{r}plot(effects::allEffects(m_breakp, grid=TRUE))```We want something better. For that, we go back to our data frame `nd` that so far only contains a sequence of values covering the predictor `PR` and the predictions of `m_straight` and1. add a new column that states for each value in that sequence which value of `PR_bin` it would come with;2. add new columns with the predictions of our model `m_breakp` for these pairs of values and their lower and upper 95% CI:```{r}# step 1:nd$PR_bin <-c("ltbp", "gtbp")[as.numeric(nd$PR>=breakpoint)+1]; nd <- nd[,c(1,5,2:4)]# step 2:nd <-cbind(nd, predict(m_breakp, newdata=nd, interval="confidence"))colnames(nd)[6:8] <-c("fit.brk", "lwr.brk", "upr.brk")nd[c(1:2, 99:102, 199:200),]```Then we plot this into @fig-mbreakat5; the result looks much better than the effects plot:```{r}#| label: fig-mbreakat5#| fig-cap: Observations against predictions for m_breakp w/ a breakpoint of 5# observed dataplot(pch=16, col=ifelse(d$PR_bin=="ltbp", "#FF00FFFF", "#00FF00FF"),x=d$PR, y=d$RE); grid(); abline(v=5, lty=3)# predicted valuespoints(x=nd$PR, y=nd$fit.brk, pch=16, cex=0.5,col=ifelse(nd$PR_bin=="ltbp", "#FF00FFFF", "#00FF00FF"))with(nd[nd$PR_bin=="ltbp",],polygon(border=NA, col="#FF00FF20",c(PR, rev(PR)), c(lwr.brk, rev(upr.brk))))with(nd[nd$PR_bin=="gtbp",],polygon(border=NA, col="#00FF0020",c(PR, rev(PR)), c(lwr.brk, rev(upr.brk))))text(5, 7, "breakpoint", srt=90, pos=2)```#### The breakpointBut this addressed 'only' the first problem -- the second is, **how do we even know the best breakpoint is indeed 5?** We just went with a visual heuristic but didn't really explore this properly, which is what we will do now. For this, we could do the following:1. we determine all possible breakpoints at which `d$PR` could be split into two values;2. for each possible breakpoint, we fit a version of `m_breakp` and collect its deviance;3. the best breakpoint is the model with the smallest deviance, which is the 'final model' we would then adopt.Here's step 1: We determine all unique values of `d$PR` and the possible breakpoints between them:```{r}unique_PR_values <- d$PR %>% unique %>% sort # sort unique values)possible_breakpoints <-setNames( # make possible_breakpoints# the middles between each value & the next (unique_PR_values[-1] + unique_PR_values[-length(unique_PR_values)]) /2,# & assign as names the lower and the upper values of each pair:paste(unique_PR_values[1:35], unique_PR_values[2:36], sep="-"))possible_breakpoints```We set up a collector vector for the deviances:```{r}possible_deviances <-setNames(rep(NA, length(possible_breakpoints)), possible_breakpoints)```Then step 2:```{r}for (i inseq(possible_breakpoints)) { # loop over the breakpoints d$PR_bin <-factor(ifelse( # create a binary predictor at breakpoint d$PR>possible_breakpoints[i], "gtbp", "ltbp")) # w/ these levels# collect the deviance when that predictor is used in the model: possible_deviances[i] <-lm(RE ~1+ PR*PR_bin, data=d) %>% deviance}```Step 3: what's the best breakpoint? The plot shows that our guesstimate of 5 really wasn't all that bad, yet still not optimal:```{r}#| label: fig-choosebestbreakp#| fig-cap: Deviances against breakpoint selectionsbestbreakpoint <- possible_breakpoints[which(possible_deviances==min(possible_deviances))]plot(type="l",xlim=c(0, 10), x=possible_breakpoints,ylim=c(0, 150), y=possible_deviances); grid()arrows( bestbreakpoint, 100, bestbreakpoint, 10)text(bestbreakpoint, 100, bestbreakpoint, pos=3)arrows(5, 50, 5, 10)text(5, 50, 5, pos=3)```So now we fit our model again, but now with this best breakpoint of `r bestbreakpoint` and we plot diagnostics of it again, too:```{r}#| label: fig-mbestbreakp6diagn#| fig-cap: Model-diagnostic plots for `m_breakp` w/ the best breakpoint of 6#| fig-width: 7d$PR_bin <-factor(ifelse(d$PR>bestbreakpoint, ">bp", "<bp"))summary(m_bestbreakp <-lm( # fit and RE ~1+# summarize the PR + PR_bin + PR:PR_bin, # model w/ thedata=d)) # new predictorsdeviance(m_bestbreakp)par(mfrow=c(1,2)) # show theplot(m_bestbreakp, which=1:2) # diagnosticpar(mfrow=c(1,1)) # plots```The *R*^2^-value of this model with the best breakpoint is a bit higher than the one with our initial heuristic breakpoint, and of course this model is significantly better than the straight-line model:```{r}anova(m_straight, m_breakp , test="F")anova(m_straight, m_bestbreakp, test="F")c("AIC of m_straight"=AIC(m_straight), "AIC of m_breakp"=AIC(m_breakp), "AIC of m_bestbreakp"=AIC(m_bestbreakp))```So let's generate a new plot, for which we first compute the new predictions ...```{r}# step 1: (again)nd$PR_bin <-c("<bp", ">bp")[as.numeric(nd$PR>=bestbreakpoint)+1]# step 2 (again):nd[6:8] <-predict(m_bestbreakp, newdata=nd, interval="confidence")nd[c(1:2, 99:102, 199:200),]```Then we plot this in the same way as before. (Since we will add more info to this plot multiple times later, I put the plotting code into a function to save 'code space' below.)```{r}#| label: fig-mbestbreakat6#| fig-cap: Observations against predictions (with heuristic breakpoint)plot1 <-function () {# observed dataplot(pch=16, col=ifelse(d$PR_bin=="<bp", "#FF00FFFF", "#00FF00FF"),x=d$PR, y=d$RE); grid(); abline(v=bestbreakpoint, lty=3)abline(m_straight, lty=2) # the straight-line regression model# predicted valuespoints(x=nd$PR, y=nd$fit.brk, pch=16, cex=0.5,col=ifelse(nd$PR_bin=="<bp", "#FF00FFFF", "#00FF00FF"))with(nd[nd$PR_bin=="<bp",],polygon(border=NA, col="#FF00FF20",c(PR, rev(PR)), c(lwr.brk, rev(upr.brk))))with(nd[nd$PR_bin==">bp",],polygon(border=NA, col="#00FF0020",c(PR, rev(PR)), c(lwr.brk, rev(upr.brk))))}; plot1()```Not a huge difference to the breakpoint model were we just assumed the breakpoint was 5 -- probably none when it comes to interpretation, in fact -- but this is 'objectively' the best model (with a breakpoint) we could get for this data.## Polynomials {#sec-s04_polynom}### A first example {#sec-s04_polynom_dur}One possible alternative would be to not posit a clear breakpoint that separates two straight lines but to assume one line with some amount of curvature. The maybe simplest way to do this would be to use **polynomial regression** kind of approach, in which the numeric predictor is entered in the rhs as a polynomial to the *n*-th degree, with *n*-1 being, in a sense, the number of different curves/turns.The straight-line model we fit above is this:```{r}summary(m_straight)deviance(m_straight)```A model with a polynomial allowing for one curve could be written like this:```{r}summary(m_polyorth <-lm( RE ~1+poly(PR, 2), # <---data=d))deviance(m_polyorth)```Certainly a massive improvement over the straight-line model! That means the curvature pretty much *has* to be significant, and it is; note: the *F*-value of the anova comparison is the square of the *t*-value for `poly(PR, 2)2` in the summary output.```{r}anova(m_straight, m_polyorth, test="F")c("AIC of m_straight"=AIC(m_straight), "AIC of m_polyorth"=AIC(m_polyorth))```What do the diagnostics say? They are not as good as those of `m_bestbreakp`, but certainly much better than the straight-line version `m_straight`:```{r}#| label: fig-mpolyorthdiagn1#| fig-cap: Model-diagnostic plots for `m_polyorth`#| fig-width: 7par(mfrow=c(1,2)) # show theplot(m_polyorth, which=1:2) # diagnosticpar(mfrow=c(1,1)) # plots```What does the fit look like? @fig-mpolyorthdiagn2 indicates it's quite good though possibly not as good as the model with a breakpoint:```{r}#| label: fig-mpolyorthdiagn2#| fig-cap: Observations against predictions (`m_straight` & `m_poly`)nd <-cbind(nd, predict(m_polyorth, newdata=nd, interval="confidence"))colnames(nd)[9:11] <-c("fit.poly", "lwr.poly", "upr.poly")plot2 <-function () {plot1()lines(x=nd$PR, y=nd$fit.poly, col="red", lty=2)polygon(col="#FF000020", border=NA,c(nd$PR, rev(nd$PR)),c(nd$lwr.poly, rev(nd$upr.poly)))}; plot2()```How could we check whether the model with the best breakpoint `m_bestbreakp` is indeed better than the polynomial `m_polyorth`? Because we can't use `anova` ...```{r}#| results: holdc("AIC of m_bestbreakp"=AIC(m_bestbreakp), "AIC of m_polyorth"=AIC(m_polyorth))```*AIC* shows that the model with the best breakpoint is *much* better than the polynomial!But (i) how do we interpret the coefficients in the summary output of `m_polyorth`? And (ii) how does the polynomial does what it does? The short answer to (i) is, you don't. The normal kind of polynomial coefficients we created here is pretty much uninterpretable and interpretation of the results is based on the visualization of the predictions as we have done before. This is because of the answer to (ii). Here are the model matrices for both models; the one for `m_straight`, fine, but the one for `m_polyorth` is not intuitive at all.```{r}model.matrix(m_straight) %>% headmodel.matrix(m_polyorth) %>% head```One reason why it's unintuitive is that the default polynomials in R are orthogonal polynomials, with *orthogonal* meaning they are transformed such that they end up not being linearly correlated with each other (which should remind you of collinearity):```{r}cor(model.matrix(m_polyorth)[,2],model.matrix(m_polyorth)[,3]) %>%round(4)```But to appreciate the logic of polynomials a bit more, we can look at a non-orthogonalized version of these polynomials, which is a bit easier to comprehend; we define polynomials again but now use the additional argument `raw=TRUE`:```{r}summary(m_polyraw <-lm( RE ~1+poly(PR, 2, raw=TRUE), # <---data=d))deviance(m_polyraw)```You can see that everything stays the same -- *R*^2^, *F*, residuals, ... -- so in that sense the model `m_polyraw` is equivalent to the model `m_polyorth`, but now check out the model matrix:```{r}model.matrix(m_straight) %>% headmodel.matrix(m_polyraw) %>% head```So the raw version of a 2nd-degree polynomial of a numeric predictor is simply* using one column for the original predictor (column 2 of the model matrix of `m_polyraw`);* using another column for what is the square of the original predictor (column 3 of the model matrix of `m_polyraw`), this is what makes the line curved.But since these columns are of course extremely highly correlated, ...```{r}cor(model.matrix(m_polyraw)[,2],model.matrix(m_polyraw)[,3]) %>%round(4)```... the default way in which `poly` proceeds is to orthogonalize these two columns, which we have seen eliminates the high correlation between the two columns in the model matrix, but also is what makes the coefficients uninterpretable.One natural follow-up question now is when to use regression with breakpoints and when a polynomial approach. Sometimes, you might be able to decide on the basis of fit and we have seen that the fit of the model with a breakpoint is much better than the polynomial model. However, sometimes you might also decide on the basis of the conceptual logic and hypotheses that underpin the analysis (maybe a breakpoint's location follows perfectly well from previous work or it is easier, or harder, to motivate a single curved causal mechanism) -- in other words, 'it depends'.### A more fine-grained example {#sec-s04_polynom_rt}Let's look at this with a simpler example, but in more detail. We take the reaction time data from the book, where the expectations were that* `FAM` would be negatively correlated with `RT`: the more familiar, the faster one reacts to the word;* `FREQ` would be negatively correlated with `RT`: the more frequent, the faster one reacts to the word.We first log `FREQ`, ...```{r}#| code-fold: showsummary(x <-read.delim( # summarize d, the result of loadingfile="_input/reactiontimes.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factorsx <-droplevels(x[complete.cases(x[,c("RT", "FREQ", "FAM")]),])x$FREQ <-log2(x$FREQ)```... and then apply a backwards model selection process to the initial model `RT ~ poly(FREQ, 2) * FAM` to identify a final model. Here's the initial model:```{r}summary(m_01 <-lm(RT ~1+poly(FREQ, 2) * FAM, data=x))head(mm_01 <-model.matrix(m_01))drop1(m_01, test="F") %>% data.frame```Given this result of `drop1`, you are likely tempted to do this next:```{r}summary(m_02a <-update(m_01, .~.-poly(FREQ, 2):FAM)) # remove what drop1 said is nsanova(m_01, m_02a, test="F") %>% data.frame # same as drop1```But note that the ns term returned by `drop1` contains 2 'elements', so to speak:* we went with keeping the curved effect but removing the interaction,* which means we lost 4 (!) columns in the model matrix:```{r}head(mm_02a <-model.matrix(m_02a))```So what about removing the curvature while still keeping the interaction (w/ `FREQ` as a straight line) in there?```{r}summary(m_02b <-update(m_01, .~.-poly(FREQ, 2)*FAM # remove the curvature+ FREQ*FAM)) # keep the interaction FAM w/ FREQ as a straight linehead(mm_02b <-model.matrix(m_02b))```You can see from the results and the model matrices that `m_02a` and `m_02b` are not identical. So let's compare things:```{r}c("m_01"=AICc(m_01), "m_02a"=AICc(m_02a), "m_0b"=AICc(m_02b))anova(m_01, m_02a, test="F") %>% data.frame # rep for convenienceanova(m_01, m_02b, test="F") %>% data.frame```Given these results, we go with `m_02a` because it is 'less significantly different' from `m_01` and has a lower *AICc*-value than `m_01` and `m_02b`. We continue:```{r}drop1(m_02a, test="F") %>% data.frame```Given this result of `drop1`, you are likely tempted to do this next:```{r}summary(m_03 <-update(m_02a, .~.-poly(FREQ, 2))) # remove what drop1 said is nsanova(m_02a, m_03, test="F") %>% data.frame # same as drop1```But note that the ns term returned by `drop1` contains 2 'elements', so to speak:* we dropped a linear component of frequency (column 2 of `mm_02a`), and* we dropped a squared component of frequency (column 3 of `mm_02a`),* which means we lost 2 (!) columns in the model matrix:```{r}head(mm_03 <-model.matrix(m_03))```Let's test each. Here are the models and their model matrices ...```{r}summary(m_03a <-update(m_02a, .~.-poly(FREQ, 2) # remove what drop1 said is ns+ FREQ)) # keep FREQ as a straight linehead(mm_03a <-model.matrix(m_03a))summary(m_03b <-update(m_02a, .~.-poly(FREQ, 2) # remove what drop1 said is ns+I(FREQ^2))) # keep FREQ squaredhead(mm_03b <-model.matrix(m_03b))```... and then we compare things again:```{r}c("m_02a"=AICc(m_02a), "m_03a"=AICc(m_03a), "m_03b"=AICc(m_03b))anova(m_02a, m_03a, test="F") %>% data.frameanova(m_02a, m_03b, test="F") %>% data.frame```We go with `m_03a` because it is not significantly different from `m_02a` and has a lower *AICc*-value than the other models:```{r}drop1(m_03a, test="F") %>% data.frame```And now we're done because ...? We said we expected that "`FREQ` would be negatively correlated with `RT`", which it is, which means we can use a one-tailed *p*-value: instead of `r drop1(m_03a, test="F") %>% data.frame %>% "["(3,6)` we get `r drop1(m_03a, test="F") %>% data.frame %>% "["(3,6) %>% "/"(2)`, which is significant.```{r}summary(m_03a)```Bottom line: You may not always have to go through all the lengths shown here, but be aware of what you're dropping/doing! Look, this is how complex it can get:```{r}summary(model_eg <-lm(RT ~1+poly(MEAN, 2, raw=TRUE) *poly(FREQ, 2, raw=TRUE), data=x))$coefficientshead(model.matrix(model_eg))```Here's what the columns are:* column 1: intercept* column 2: `MEAN`* column 3: `MEAN`^2* column 4: `FREQ`* column 5: `FREQ`^2* column 6: `MEAN` * `FRE`Q* column 7: `MEAN`^2 * `FREQ`* column 8: `MEAN` * `FREQ`^2* column 9: `MEAN`^2 * `FREQ`^2## Splines and/or GAMs {#sec-s04_splinesgams}There are other, more sophisticated ways to model curvature. Polynomials are a good first shot and I nearly always try them first but they are sometimes not sensitive enough. One alternative is what are called **splines** (usually "natural cubic splines"), and in the simplest case they can be fit fairly easily like this:```{r}library(splines)summary(m_spline_2 <-lm( RE ~1+ns(PR, knots=2),data=d))```While the coefficients are again uninterpretable, this seems very similar to the quality of the polynomial model, but the polynomial is in fact a bit better, which is confirmed by the *AIC*-values as well:```{r}AIC(m_spline_2)AIC(m_polyorth)```Let's compare the predictions of both; in @fig-splines, we add the new result to all previous ones:```{r}#| label: fig-splines#| fig-cap: Observations against predictions (`m_straight`, `m_bestbreakpoint`, `m_poly`, `m_splines`)nd <-cbind(nd, predict(m_spline_2, newdata=nd, interval="confidence"))colnames(nd)[12:14] <-c("fit.spli", "lwr.spli", "upr.spli")plot3 <-function () {plot2()lines(nd$PR, nd$fit.spli, col="blue")polygon(col="#0000FF20", border=NA,c(nd$PR, rev(nd$PR)),c(nd$lwr.spli, rev(nd$upr.spli)))}; plot3()```For the sake of interpreting this, I am not sure the statistical difference here between the polynomial and the splines could be important conceptually, theoretically, or in any applied sense.The final way to look at curvature is the most powerful but also the by far most complex one involves a method called **generalized additive models** (gams). On the surface, using this method is deceptively simple; using the package `mgcv`, the approach looks very much like the splines one:```{r}#| message: falselibrary(mgcv)summary(m_gam <-gam( RE ~1+s(PR), data=d))```Variance explanation is excellent: *R*^2^ is greater than 0.95, but *R*^2^ of `m_bestbreakp` was a tiny bit better -- *AIC* has a slight preference for `m_gam`:```{r}AIC(m_breakp)AIC(m_gam)```Also, we can also see that the smooth term -- the thing responsible for the curvature -- is highly significant and involves a degree of curvature that is equivalent to 6 knots (*edf*, for 'empirical degrees of freedom', the squiggliness the gam estimated from the data). But as before, these numerical results don't tell us anything about interpretation, so we plot again, this time into @fig-gam (and without a function now because this will be the last plot):```{r}#| label: fig-gam#| fig-cap: Observations against predictions (all models)nd <-cbind(nd, predict(m_gam, newdata=nd))colnames(nd)[15] <-"fit.gam"plot3(); lines(nd$PR, nd$fit.gam, col="darkgreen", lwd=3)```The results of the gam clearly support the regression with breakpoints -- although, the big turning point of the gam is again closer to `PR`=5 ... -- and I leave it up to you to verify that diagnostics such as predicted-vs.-fitted look fine. You might therefore of course wonder why one wouldn't always use a gam -- the answer is what I alluded to above: gams are indeed extremely powerful and in the best of worlds we'd use them much more often, but the downside is that the complexity that comes with them makes them often really hard to apply. And in this rarest of cases, the gam is not consistently better than the regression with the best breakpoint (and any interpretation of tha gam's result would probably be identical to those of the model with the best breakpoint.## Exercises### Exercise 1Take the duration data, log `DURATION` as before, square root `SURPRISAL` as before, ...```{r}#| code-fold: showrm(list=ls(all.names=TRUE))pkgs2load <-c("car", "effects", "magrittr", "multcomp", "MuMIn")suppressMessages(sapply(pkgs2load, library,character.only=TRUE, logical.return=TRUE, quietly=TRUE)); rm(pkgs2load)summary(d <-read.delim("_input/wdurations.csv", stringsAsFactors=TRUE))d$DURATION.l <-log2(d$DURATION)d$SURPRISAL.s <-sqrt(d$SURPRISAL)```... and then determine whether the relationship between `DURATION.l` and `SURPRISAL.s` is better captured with a straight line or a polynomial to the second degree.```{r}plot(pch=16, col="#00000020",x=jitter(d$SURPRISAL.s),y= d$DURATION.l ); grid()summary(m_straight <-lm(DURATION.l ~1+ SURPRISAL.s, data=d))abline(m_straight, col="red")# show the diagnostic plots:par(mfrow=c(1,2))plot(m_straight, which=1:2)par(mfrow=c(1,1))summary(m_polyorth <-lm( DURATION.l ~1+poly(SURPRISAL.s, 2),data=d))m_polyraw <-lm(DURATION.l ~1+poly(SURPRISAL.s, 2, raw=TRUE), data=d)anova(m_straight, m_polyorth, test="F") # same as anova(m_straight, m_polyraw, test="F")AIC(m_straight)AIC(m_polyorth) # rel lik: 6.463901e+40head(model.matrix(m_straight)) # recaphead(model.matrix(m_polyorth)) # recaphead(model.matrix(m_polyraw)) # it undoes our transformation ;-)nd <-data.frame(SURPRISAL.s=seq(min(d$SURPRISAL.s),max(d$SURPRISAL.s),length.out=10))nd <-cbind(nd, predict( m_polyraw, # or m_polyorthnewdata=nd,interval="confidence"))plot(pch=16, col="#00000020",x=jitter(d$SURPRISAL.s),y= d$DURATION.l)grid(); abline(m_straight, col="red")lines(nd$SURPRISAL.s, nd$fit, col="#00FF00FF", lwd=2)polygon(c(nd$SURPRISAL.s, rev(nd$SURPRISAL.s)),c(nd$lwr, rev(nd$upr)),col="#00FF0020", border=NA)```### Exercise 2Hilpert (2011) discusses the change in the frequency of the *keep* V-*ing* construction in 18 years of the TIME Magazine corpus; the data are in [_input/keepVing.csv](_input/keepVing.csv):```{r}rm(list=ls(all.names=TRUE))pkgs2load <-c("car", "effects", "emmeans", "multcomp")suppressMessages(sapply(pkgs2load, library,character.only=TRUE, logical.return=TRUE, quietly=TRUE)); rm(pkgs2load)summary(d <-read.delim("_input/keepVing.csv", stringsAsFactors=TRUE))```You can see what the variables/columns contain in the file [_input/keepVing.r](_input/keepVing.r). Fit a nice regression to model the normalized frequency of the construction over time.```{r}plot(pch=16, col="#00000020",x=jitter(d$YEAR),y= d$TOKpmw); grid()summary(m_straight <-lm(TOKpmw ~1+ YEAR, data=d))abline(m_straight, col="red")# show the diagnostic plots:par(mfrow=c(1,2))plot(m_straight, which=1:2)par(mfrow=c(1,1))# regression with breakpointspossible_breakpoints <-setNames( # make possible_breakpoints# the middles between each value & the next (d$YEAR[-1] + d$YEAR[-length(d$YEAR)]) /2,# & assign as names the lower and the upper values of each pair:paste(d$YEAR[1:17], d$YEAR[2:18], sep="-"))possible_deviances <-setNames(rep(NA, length(possible_breakpoints)), possible_breakpoints)for (i inseq(possible_breakpoints)) { # loop over the breakpoints d$YEAR_bin <-factor(ifelse( # create a binary predictor at breakpoint d$YEAR>possible_breakpoints[i], "gtbp", "ltbp")) # w/ these levels# collect the deviance when that predictor is used in the model: possible_deviances[i] <-lm(TOKpmw ~1+ YEAR*YEAR_bin, data=d) %>% deviance}sort(possible_deviances)breakpoint <-1996.5# I am not picking the best one here -- why?plot(type="l",xlim=c(1990, 2007), x=possible_breakpoints,ylim=c(0, 350), y=possible_deviances); grid()arrows(breakpoint, 250, breakpoint, possible_deviances[as.character(breakpoint)])text(breakpoint, 250, breakpoint, pos=3)d$YEAR_bin <-factor(ifelse(d$YEAR>breakpoint, "gtbp", "ltbp"), levels=c("ltbp", "gtbp"))summary(m_breakp <-lm( # fit and TOKpmw ~1+# summarize the YEAR + YEAR_bin + YEAR:YEAR_bin, # model w/ thedata=d)) # new predictors# very small loss of R-squared for the not quite optimal breakpointpar(mfrow=c(1,2)) # show theplot(m_breakp, which=1:2) # diagnosticpar(mfrow=c(1,1)) # plots# still not perfect (but this is 18 data points ...)anova(m_straight, m_breakp, test="F")AIC(m_straight); AIC(m_breakp)# significantly better than the straight line, of coursed <-cbind(d,predict(m_breakp, newdata=d, interval="confidence"))colnames(d)[6:8] <-c("fit.brk", "lwr.brk", "upr.brk")plot1 <-function() {# observed dataplot(pch=16, col=ifelse(d$YEAR_bin=="ltbp", "#FF00FFFF", "#00FF00FF"),x=d$YEAR, y=d$TOKpmw); grid(); abline(v=breakpoint, lty=3)abline(m_straight, lty=2) # the straight-line regression model# predicted valueslines(x=d$YEAR[d$YEAR_bin=="ltbp"], y=d$fit.brk[d$YEAR_bin=="ltbp"], pch=16, cex=0.5, col="#FF00FFFF")with(d[d$YEAR_bin=="ltbp",],polygon(border=NA, col="#FF00FF30",c(YEAR, rev(YEAR)), c(lwr.brk, rev(upr.brk))))lines(x=d$YEAR[d$YEAR_bin=="gtbp"], y=d$fit.brk[d$YEAR_bin=="gtbp"], pch=16, cex=0.5, col="#00FF00FF")with(d[d$YEAR_bin=="gtbp",],polygon(border=NA, col="#00FF0030",c(YEAR, rev(YEAR)), c(lwr.brk, rev(upr.brk))))}; plot1()# splinelibrary(splines)summary(m_spline <-lm( TOKpmw ~1+ns(YEAR, knots=2),data=d))AIC(m_breakp); AIC(m_spline) # spline is much worse# gamlibrary(mgcv)summary(m_gam <-gam( TOKpmw ~1+s(YEAR), data=d))AIC(m_breakp); AIC(m_gam) # gam is better, but might be too fine-grained!# note how I compute the confidence interval for the gam predictions:qwe <-predict(m_gam, newdata=d, se.fit=TRUE) # w/ std. errors ...d <-cbind(d, "fit.gam"=qwe$fit, # ... from the"lwr.gam"=qwe$fit+qnorm(0.025)*qwe$se.fit, # normal"upr.gam"=qwe$fit+qnorm(0.975)*qwe$se.fit) # distributionplot1(); lines(d$YEAR, d$fit.gam, col="blue", lwd=3)polygon(border=NA, col="#0000FF10",c(d$YEAR, rev(d$YEAR)), c(d$lwr.gam, rev(d$upr.gam)))``````{r}#| echo: false#| eval: false# I am not picking the lowest possible deviance anymore because, if one chooses the best deviance,# the one gets two very disjointed regression lines, each with a positive slope (although# only the before one is then significantly different from 0.```Additional question: if we go with `m_breakp` here: What are the slopes of `YEAR` before and after the breakpoint, i.e. during the ascent of the construction and when it seems to level off? Does it really* 'ascend' (in the sense that the slope of `YEAR` before the breakpoint is indeed significantly greater than 0)?* 'level off (in the sense that the slope of `YEAR` after the breakpoint is indeed not significantly different from 0 anymore)?In other words, what are the two slopes of `YEAR` before and after the breakpoint and are they significantly different from 0. (We know they are significantly different from each other from the model summary.) If this looks like a `glht` question to you, good for you!```{r}#| echo: false#| eval: falselibrary(multcomp)before <-glht(m_breakp, matrix(c(0, 1, 0, 0), nrow=1))(before <-setNames(c(unlist(summary(before)$test[c(3,6)]), confint(before)$confint[2:3]),c("slope", "p", "lower", "upper")))after <-glht(m_breakp, matrix(c(0, 1, 0, 1), nrow=1))(after <-setNames(c(unlist(summary(after)$test[c(3,6)]), confint(after)$confint[2:3]),c("slope", "p", "lower", "upper")))```# Session info```{r}sessionInfo()```