Ling 204: session 04

Author
Affiliations

UC Santa Barbara

JLU Giessen

Published

28 Jan 2026 11-34-56

1 Session 04: Curvature

In nearly all models involving numeric predictors so far, we have implicitly adopted the null hypothesis that the relationship between a numeric predictor and a response variable – numeric or categorical – is best, or at least most parsimoniously, captured with a straight line. That would be most parsimonious because it involves the smallest number of parameters: an intercept and a slope rather than also parameters that determine how a line would deviate from a straight line. But for many phenomena a straight line is most likely not the right approach (to put it very mildly). Learning, forgetting, priming, language change, etc. all involve non-linearities that we ideally would try to capture. This session will deal with a few approaches to this:

  • regression with breakpoints in Section 1.1;
  • regression involving polynomials in Section 1.2;
  • more flexible approach such as splines and generalized additive models in Section 1.3.

1.1 Regression w/ breakpoints

1.1.1 A simple linear model

Let’s take a small data set and fit a simple linear model on it:

rm(list=ls(all.names=TRUE))
pkgs2load <- c("car", "effects", "magrittr", "multcomp", "MuMIn")
suppressMessages(sapply(pkgs2load, library,
   character.only=TRUE, logical.return=TRUE, quietly=TRUE)); rm(pkgs2load)
     car  effects magrittr multcomp    MuMIn
    TRUE     TRUE     TRUE     TRUE     TRUE 
summary(d <- data.frame(
   PR=c(0.98, 0.98, 1.52, 2.14, 2.61, 3.05, 3.65, 4.52, 5.09, 5.71, 5.51, 5.52, 4.92, 4.38,
        3.77, 3.39, 2.39, 2.39, 1.86, 1.41, 0.94, 1.92, 3.32, 4.16, 5.01, 5.56, 6.28, 6.66,
        7.26, 8.22, 8.71, 9.31, 8.79, 8.39, 8.39, 7.77, 7.02, 6.78, 6.32, 6.32),
   RE=c(5.48, 5.48, 4.71, 4.46, 4.11, 4.03, 3.64, 2.55, 1.88, 2.13, 2.27, 2.27, 3.08, 3.08,
        3.54, 3.78, 4.88, 4.88, 5.31, 5.83, 5.58, 5.23, 4.61, 3.25, 3.11, 3.68, 4.74, 5.66,
        6.51, 8.21, 9.01, 9.43, 8.83, 8.16, 8.16, 7.07, 5.41, 5.41, 4.63, 4.63)))
       PR              RE
 Min.   :0.940   Min.   :1.880
 1st Qu.:2.555   1st Qu.:3.615
 Median :4.965   Median :4.725
 Mean   :4.823   Mean   :4.968
 3rd Qu.:6.690   3rd Qu.:5.600
 Max.   :9.310   Max.   :9.430  

We first compute the deviance of the response and the simplest straight-line model:

deviance(m_00 <- lm(RE ~ 1, data=d))
[1] 152.3095
summary(m_straight <- lm(RE ~ 1 + PR, data=d))

Call:
lm(formula = RE ~ 1 + PR, data = d)

Residuals:
    Min      1Q  Median      3Q     Max
-3.1881 -1.1411 -0.0121  1.5548  2.7758

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.1550     0.6128   5.149 8.33e-06 ***
PR            0.3758     0.1132   3.322  0.00199 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.762 on 38 degrees of freedom
Multiple R-squared:  0.225, Adjusted R-squared:  0.2046
F-statistic: 11.03 on 1 and 38 DF,  p-value: 0.001986

This doesn’t look all bad:

  • the model/predictor is very significant;
  • R2 is not all bad;
  • the residual summary is looking very good with a median very close to 0 and an at least somewhat symmetric distribution.

But the model diagnostics are terrible, as we see in Figure 1:

par(mfrow=c(1,2))           # show the
plot(m_straight, which=1:2) # diagnostic
par(mfrow=c(1,1))           # plots
Figure 1: Model-diagnostic plots for m_straight

And if we compute the values that this model predicts …

nd <- data.frame(PR=seq(0, 10, 0.05))
nd <- cbind(nd, predict(m_straight, newdata=nd, interval="confidence"))
colnames(nd)[2:4] <- c("fit.str", "lwr.str", "upr.str")
nd %>% head
    PR  fit.str  lwr.str  upr.str
1 0.00 3.155040 1.914563 4.395517
2 0.05 3.173833 1.943545 4.404120
3 0.10 3.192625 1.972504 4.412746
4 0.15 3.211417 2.001440 4.421394
5 0.20 3.230210 2.030353 4.430066
6 0.25 3.249002 2.059241 4.438762
nd %>% tail
       PR  fit.str  lwr.str  upr.str
196  9.75 6.819548 5.557808 8.081288
197  9.80 6.838340 5.566345 8.110335
198  9.85 6.857133 5.574862 8.139403
199  9.90 6.875925 5.583359 8.168490
200  9.95 6.894717 5.591837 8.197597
201 10.00 6.913510 5.600297 8.226723

… and then plot those together with the observed values into Figure 2, we can see why the diagnostics say the model is terrible:

plot(d$PR, d$RE, pch=16); grid()
lines(x=nd$PR, y=nd$fit.str, lty=2)
polygon(col="#00000020", border=NA,
   c(nd$PR, rev(nd$PR)),
   c(nd$lwr.str, rev(nd$upr.str)))
Figure 2: Observations against predictions for m_straight

Clearly, we need something better, which is what we turn to now.

1.1.2 A better approach

Given Figure 1 and Figure 2, it seems as if there are two trends:

  • one when d$PR is between roughly 1 and 5;
  • one when d$PR is between roughly 5 and 9.5.

How would we approach this in a somewhat convoluted way? We might do something like this:

  1. we split d up into two parts depending on whether d$PR is greater than 5 or not;
  2. we compute the deviance of RE in each part of d;
  3. we fit a separate lm regression line for RE ~ PR for each part of d;
  4. we can compute some version of an R2 based on the summed deviance of the two separate models.

Here are steps 1 and 2:

lapply(d.split <- split(d, d$PR>5), summary)
$`FALSE`
       PR              RE
 Min.   :0.940   Min.   :2.550
 1st Qu.:1.775   1st Qu.:3.615
 Median :2.500   Median :4.535
 Mean   :2.715   Mean   :4.375
 3rd Qu.:3.680   3rd Qu.:5.250
 Max.   :4.920   Max.   :5.830

$`TRUE`
       PR              RE
 Min.   :5.010   Min.   :1.880
 1st Qu.:5.673   1st Qu.:3.538
 Median :6.720   Median :5.410
 Mean   :6.931   Mean   :5.560
 3rd Qu.:8.262   3rd Qu.:8.160
 Max.   :9.310   Max.   :9.430  
"+"(
   deviance(lm(RE ~ 1, data=d.split[["FALSE"]])), # 18.0321
   deviance(lm(RE ~ 1, data=d.split[["TRUE"]])))  # 120.247
[1] 138.2791

Note how the combined deviance of the two models (138.2791) is already lower than the initial null deviance (152.3095) because we’re now using two means for prediction instead of just one for the first null model on all the data. Now we do step 3:

summary(m_PRlt5 <- lm(RE ~ 1 + PR, data=d.split[["FALSE"]]))

Call:
lm(formula = RE ~ 1 + PR, data = d.split[["FALSE"]])

Residuals:
     Min       1Q   Median       3Q      Max
-0.53738 -0.16137 -0.07596  0.26915  0.67591

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  6.35638    0.18091   35.14  < 2e-16 ***
PR          -0.72961    0.06072  -12.02 4.94e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3332 on 18 degrees of freedom
Multiple R-squared:  0.8892,    Adjusted R-squared:  0.883
F-statistic: 144.4 on 1 and 18 DF,  p-value: 4.938e-10
summary(m_PRgt5 <- lm(RE ~ 1 + PR, data=d.split[["TRUE"]]))

Call:
lm(formula = RE ~ 1 + PR, data = d.split[["TRUE"]])

Residuals:
     Min       1Q   Median       3Q      Max
-1.20835 -0.31651  0.05408  0.31631  1.04533

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.05120    0.64409  -10.95 2.18e-09 ***
PR           1.81954    0.09129   19.93 1.02e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5381 on 18 degrees of freedom
Multiple R-squared:  0.9567,    Adjusted R-squared:  0.9542
F-statistic: 397.2 on 1 and 18 DF,  p-value: 1.023e-13

Very nice R2s! And, step 4:

(sum_of_devs <- deviance(m_PRlt5) + deviance(m_PRgt5))
[1] 7.211284

Which means, the two models account for ≈95.3% of the overall null deviance:

(152.3095 - sum_of_devs) / 152.3095
[1] 0.9526537

Also, we can check whether the slopes of PR are significantly different from each other by checking that the confidence intervals

  • don’t overlap with each other;
  • don’t include 0:
Confint(m_PRlt5)
              Estimate     2.5 %     97.5 %
(Intercept)  6.3563787  5.976305  6.7364527
PR          -0.7296054 -0.857170 -0.6020408
Confint(m_PRgt5)
             Estimate     2.5 %    97.5 %
(Intercept) -7.051199 -8.404375 -5.698022
PR           1.819535  1.627737  2.011333

Clearly, the slopes are both significantly different from each other and from 0.

The better way, the one that also gets us actual p-values, would of course involve multcomp::glht. The following checks whether each slope is significantly different from the other:

summary(glht( # show the summary of a glht, namely
   m_PRlt5,   # check whether the sum of (these coeffs of m_PRlt5
   matrix(c(0,1), nrow=1), # multiplied by this)
   rhs=1.82)) # is significantly different from 1.82

     Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = RE ~ 1 + PR, data = d.split[["FALSE"]])

Linear Hypotheses:
          Estimate Std. Error t value Pr(>|t|)
1 == 1.82 -0.72961    0.06072  -41.99   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
summary(glht( # show the summary of a glht, namely
   m_PRgt5,   # check whether the sum of (these coeffs of m_PRgt5
   matrix(c(0,1), nrow=1), # multiplied by this)
   rhs=-0.73)) # is significantly different from -0.73

     Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = RE ~ 1 + PR, data = d.split[["TRUE"]])

Linear Hypotheses:
           Estimate Std. Error t value Pr(>|t|)
1 == -0.73  1.81954    0.09129   27.93 3.33e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

… and this now checks whether the slopes are significantly different from 0:

glht(m_PRlt5, matrix(c(0, 1), nrow=1)) %>% { list(
   "CI"=confint(.)$confint,
    "P"=summary(.)$test$pvalues[1]) }
$CI
    Estimate      lwr        upr
1 -0.7296054 -0.85717 -0.6020408
attr(,"conf.level")
[1] 0.95
attr(,"calpha")
[1] 2.100922

$P
[1] 4.937961e-10
glht(m_PRgt5, matrix(c(0, 1), nrow=1)) %>% { list(
   "CI"=confint(.)$confint,
    "P"=summary(.)$test$pvalues[1]) }
$CI
  Estimate      lwr      upr
1 1.819535 1.627737 2.011333
attr(,"conf.level")
[1] 0.95
attr(,"calpha")
[1] 2.100922

$P
[1] 1.022515e-13

But this is not how we should do this for two reasons. The first of these is that we seem to have a situation where we can say ‘the effect of PR is not the same everywhere’, which means we should not go with two separate models, but one, namely one that involves an interaction. Interestingly, the interaction here is kind of like a numeric variable interacting with a binary version of itself!

1.1.2.1 The interaction

Thus, we start by defining a variable PR_bin and add it to the data frame d:

if(!any(d$PR==5)) { breakpoint <- 5 }
d$PR_bin <- factor(
   ifelse(d$PR>breakpoint, "gtbp", "ltbp"),
   levels=c("ltbp", "gtbp"))
summary(d <- d[,c(1,3,2)])
       PR         PR_bin         RE
 Min.   :0.940   ltbp:20   Min.   :1.880
 1st Qu.:2.555   gtbp:20   1st Qu.:3.615
 Median :4.965             Median :4.725
 Mean   :4.823             Mean   :4.968
 3rd Qu.:6.690             3rd Qu.:5.600
 Max.   :9.310             Max.   :9.430  

Then we fit a new model that includes the ‘old’ numeric predictor but now also its binary version ,as well as the interaction of the two and plot its diagnostics:

summary(m_breakp <- lm(     # fit and
   RE ~ 1 +                 # summarize the
   PR + PR_bin + PR:PR_bin, # model w/ the
   data=d))                 # new predictors

Call:
lm(formula = RE ~ 1 + PR + PR_bin + PR:PR_bin, data = d)

Residuals:
    Min      1Q  Median      3Q     Max
-1.2084 -0.1990 -0.0547  0.2820  1.0453

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
(Intercept)     6.35638    0.24298  26.160  < 2e-16 ***
PR             -0.72961    0.08155  -8.947 1.11e-10 ***
PR_bingtbp    -13.40758    0.58822 -22.794  < 2e-16 ***
PR:PR_bingtbp   2.54914    0.11143  22.878  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4476 on 36 degrees of freedom
Multiple R-squared:  0.9527,    Adjusted R-squared:  0.9487
F-statistic: 241.5 on 3 and 36 DF,  p-value: < 2.2e-16
deviance(m_breakp) # same as sum_of_devs
[1] 7.211284
par(mfrow=c(1,2))         # show the
plot(m_breakp, which=1:2) # diagnostic
par(mfrow=c(1,1))         # plots
Figure 3: Model-diagnostic plots for m_breakp w/ a breakpoint of 5

See how well this model does? Look at both the range of x-axis values of the first diagnostic plot and the relative lack of structure in it!) It has the same deviance as what before computed for sum_of_devs from the deviances of m_PRlt5 and m_PRgt5, a very high multiple R2, and the interaction is significant, which shows that

  • the slope of PR in m_PRlt5 (-0.7296054) and
  • the slope of PR in m_PRgt5 (1.8195353)

are indeed significantly different. Plus, obviously, the diagnostics of this single model now look much better than they did for m_straight, which becomes clearest if we compare the fitted-vs-.-residuals plots of both models in one panel like in Figure 4:

plot(pch=16, col="#00000030",
   xlab="Fitted values", xlim=c( 3,   7)  , x=predict(m_straight),
   ylab="Residuals"    , ylim=c(-3.5, 3.5), y=residuals(m_straight)); grid()
points(x=predict(m_breakp), y=residuals(m_breakp), pch=16)
text(6  ,  2.5, "m_straight", col="#00000030")
text(6.5, -0.5, "m_breakp")
Figure 4: Model-diagnostic plots for m_straight vs. m_breakp

How might we plot this best? The usually best strategy – go with effects – isn’t too great here:

plot(effects::allEffects(m_breakp, grid=TRUE))

We want something better. For that, we go back to our data frame nd that so far only contains a sequence of values covering the predictor PR and the predictions of m_straight and

  1. add a new column that states for each value in that sequence which value of PR_bin it would come with;
  2. add new columns with the predictions of our model m_breakp for these pairs of values and their lower and upper 95% CI:
# step 1:
nd$PR_bin <- c("ltbp", "gtbp")[as.numeric(nd$PR>=breakpoint)+1]; nd <- nd[,c(1,5,2:4)]
# step 2:
nd <- cbind(nd, predict(m_breakp, newdata=nd, interval="confidence"))
colnames(nd)[6:8] <- c("fit.brk", "lwr.brk", "upr.brk")
nd[c(1:2, 99:102, 199:200),]
      PR PR_bin  fit.str  lwr.str  upr.str   fit.brk   lwr.brk   upr.brk
1   0.00   ltbp 3.155040 1.914563 4.395517  6.356379  5.863594  6.849163
2   0.05   ltbp 3.173833 1.943545 4.404120  6.319898  5.834637  6.805159
99  4.90   ltbp 4.996690 4.432278 5.561102  2.781312  2.366830  3.195794
100 4.95   ltbp 5.015483 4.450597 5.580368  2.744832  2.323120  3.166544
101 5.00   gtbp 5.034275 4.468684 5.599866  2.046477  1.686457  2.406498
102 5.05   gtbp 5.053067 4.486540 5.619595  2.137454  1.783766  2.491142
199 9.90   gtbp 6.875925 5.583359 8.168490 10.962200 10.461978 11.462422
200 9.95   gtbp 6.894717 5.591837 8.197597 11.053177 10.545908 11.560445

Then we plot this into Figure 5; the result looks much better than the effects plot:

# observed data
plot(pch=16, col=ifelse(d$PR_bin=="ltbp", "#FF00FFFF", "#00FF00FF"),
   x=d$PR, y=d$RE); grid(); abline(v=5, lty=3)
# predicted values
points(x=nd$PR, y=nd$fit.brk, pch=16, cex=0.5,
      col=ifelse(nd$PR_bin=="ltbp", "#FF00FFFF", "#00FF00FF"))
with(nd[nd$PR_bin=="ltbp",],
   polygon(border=NA, col="#FF00FF20",
      c(PR, rev(PR)), c(lwr.brk, rev(upr.brk))))
with(nd[nd$PR_bin=="gtbp",],
   polygon(border=NA, col="#00FF0020",
      c(PR, rev(PR)), c(lwr.brk, rev(upr.brk))))
text(5, 7, "breakpoint", srt=90, pos=2)
Figure 5: Observations against predictions for m_breakp w/ a breakpoint of 5

1.1.2.2 The breakpoint

But this addressed ‘only’ the first problem – the second is, how do we even know the best breakpoint is indeed 5? We just went with a visual heuristic but didn’t really explore this properly, which is what we will do now. For this, we could do the following:

  1. we determine all possible breakpoints at which d$PR could be split into two values;
  2. for each possible breakpoint, we fit a version of m_breakp and collect its deviance;
  3. the best breakpoint is the model with the smallest deviance, which is the ‘final model’ we would then adopt.

Here’s step 1: We determine all unique values of d$PR and the possible breakpoints between them:

unique_PR_values <- d$PR %>% unique %>% sort # sort unique values)
possible_breakpoints <- setNames( # make possible_breakpoints
   # the middles between each value & the next
   (unique_PR_values[-1] + unique_PR_values[-length(unique_PR_values)]) / 2,
   # & assign as names the lower and the upper values of each pair:
   paste(unique_PR_values[1:35], unique_PR_values[2:36], sep="-"))
possible_breakpoints
0.94-0.98 0.98-1.41 1.41-1.52 1.52-1.86 1.86-1.92 1.92-2.14 2.14-2.39 2.39-2.61
    0.960     1.195     1.465     1.690     1.890     2.030     2.265     2.500
2.61-3.05 3.05-3.32 3.32-3.39 3.39-3.65 3.65-3.77 3.77-4.16 4.16-4.38 4.38-4.52
    2.830     3.185     3.355     3.520     3.710     3.965     4.270     4.450
4.52-4.92 4.92-5.01 5.01-5.09 5.09-5.51 5.51-5.52 5.52-5.56 5.56-5.71 5.71-6.28
    4.720     4.965     5.050     5.300     5.515     5.540     5.635     5.995
6.28-6.32 6.32-6.66 6.66-6.78 6.78-7.02 7.02-7.26 7.26-7.77 7.77-8.22 8.22-8.39
    6.300     6.490     6.720     6.900     7.140     7.515     7.995     8.305
8.39-8.71 8.71-8.79 8.79-9.31
    8.550     8.750     9.050 

We set up a collector vector for the deviances:

possible_deviances <- setNames(
   rep(NA, length(possible_breakpoints)),
   possible_breakpoints)

Then step 2:

for (i in seq(possible_breakpoints)) { # loop over the breakpoints
   d$PR_bin <- factor(ifelse( # create a binary predictor at breakpoint
      d$PR>possible_breakpoints[i], "gtbp", "ltbp")) # w/ these levels
   # collect the deviance when that predictor is used in the model:
   possible_deviances[i] <- lm(RE ~ 1 + PR*PR_bin, data=d) %>% deviance
}

Step 3: what’s the best breakpoint? The plot shows that our guesstimate of 5 really wasn’t all that bad, yet still not optimal:

bestbreakpoint <- possible_breakpoints[which(possible_deviances==min(possible_deviances))]
plot(type="l",
   xlim=c(0,  10), x=possible_breakpoints,
   ylim=c(0, 150), y=possible_deviances); grid()
arrows( bestbreakpoint, 100, bestbreakpoint, 10)
   text(bestbreakpoint, 100, bestbreakpoint, pos=3)
arrows(5, 50, 5, 10)
   text(5, 50, 5, pos=3)
Figure 6: Deviances against breakpoint selections

So now we fit our model again, but now with this best breakpoint of 5.995 and we plot diagnostics of it again, too:

d$PR_bin <- factor(
   ifelse(d$PR>bestbreakpoint, ">bp", "<bp"))
summary(m_bestbreakp <- lm( # fit and
   RE ~ 1 +                 # summarize the
   PR + PR_bin + PR:PR_bin, # model w/ the
   data=d))                 # new predictors

Call:
lm(formula = RE ~ 1 + PR + PR_bin + PR:PR_bin, data = d)

Residuals:
     Min       1Q   Median       3Q      Max
-0.85011 -0.14936 -0.09176  0.26467  1.27806

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)    6.28409    0.17690   35.52   <2e-16 ***
PR            -0.69823    0.04802  -14.54   <2e-16 ***
PR_bin>bp    -12.11922    0.79722  -15.20   <2e-16 ***
PR:PR_bin>bp   2.36978    0.11234   21.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3835 on 36 degrees of freedom
Multiple R-squared:  0.9652,    Adjusted R-squared:  0.9623
F-statistic: 333.2 on 3 and 36 DF,  p-value: < 2.2e-16
deviance(m_bestbreakp)
[1] 5.294669
par(mfrow=c(1,2))         # show the
plot(m_bestbreakp, which=1:2) # diagnostic
par(mfrow=c(1,1))         # plots
Figure 7: Model-diagnostic plots for m_breakp w/ the best breakpoint of 6

The R2-value of this model with the best breakpoint is a bit higher than the one with our initial heuristic breakpoint, and of course this model is significantly better than the straight-line model:

anova(m_straight, m_breakp    , test="F")
Analysis of Variance Table

Model 1: RE ~ 1 + PR
Model 2: RE ~ 1 + PR + PR_bin + PR:PR_bin
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)
1     38 118.038
2     36   7.211  2    110.83 276.63 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m_straight, m_bestbreakp, test="F")
Analysis of Variance Table

Model 1: RE ~ 1 + PR
Model 2: RE ~ 1 + PR + PR_bin + PR:PR_bin
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)
1     38 118.038
2     36   5.295  2    112.74 383.29 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
c("AIC of m_straight"=AIC(m_straight), "AIC of m_breakp"=AIC(m_breakp), "AIC of m_bestbreakp"=AIC(m_bestbreakp))
  AIC of m_straight     AIC of m_breakp AIC of m_bestbreakp
          162.80012            54.98579            42.62793 

So let’s generate a new plot, for which we first compute the new predictions …

# step 1: (again)
nd$PR_bin <- c("<bp", ">bp")[as.numeric(nd$PR>=bestbreakpoint)+1]
# step 2 (again):
nd[6:8] <- predict(m_bestbreakp, newdata=nd, interval="confidence")
nd[c(1:2, 99:102, 199:200),]
      PR PR_bin  fit.str  lwr.str  upr.str   fit.brk   lwr.brk   upr.brk
1   0.00    <bp 3.155040 1.914563 4.395517  6.284089  5.925314  6.642863
2   0.05    <bp 3.173833 1.943545 4.404120  6.249177  5.894804  6.603551
99  4.90    <bp 4.996690 4.432278 5.561102  2.862775  2.647124  3.078426
100 4.95    <bp 5.015483 4.450597 5.580368  2.827864  2.608744  3.046984
101 5.00    <bp 5.034275 4.468684 5.599866  2.792952  2.570311  3.015594
102 5.05    <bp 5.053067 4.486540 5.619595  2.758041  2.531828  2.984254
199 9.90    >bp 6.875925 5.583359 8.168490 10.713196 10.193426 11.232965
200 9.95    >bp 6.894717 5.591837 8.197597 10.796773 10.267549 11.325998

Then we plot this in the same way as before. (Since we will add more info to this plot multiple times later, I put the plotting code into a function to save ‘code space’ below.)

plot1 <- function () {
   # observed data
   plot(pch=16, col=ifelse(d$PR_bin=="<bp", "#FF00FFFF", "#00FF00FF"),
      x=d$PR, y=d$RE); grid(); abline(v=bestbreakpoint, lty=3)
   abline(m_straight, lty=2) # the straight-line regression model
   # predicted values
   points(x=nd$PR, y=nd$fit.brk, pch=16, cex=0.5,
         col=ifelse(nd$PR_bin=="<bp", "#FF00FFFF", "#00FF00FF"))
   with(nd[nd$PR_bin=="<bp",],
      polygon(border=NA, col="#FF00FF20",
         c(PR, rev(PR)), c(lwr.brk, rev(upr.brk))))
   with(nd[nd$PR_bin==">bp",],
      polygon(border=NA, col="#00FF0020",
         c(PR, rev(PR)), c(lwr.brk, rev(upr.brk))))
}; plot1()
Figure 8: Observations against predictions (with heuristic breakpoint)

Not a huge difference to the breakpoint model were we just assumed the breakpoint was 5 – probably none when it comes to interpretation, in fact – but this is ‘objectively’ the best model (with a breakpoint) we could get for this data.

1.2 Polynomials

1.2.1 A first example

One possible alternative would be to not posit a clear breakpoint that separates two straight lines but to assume one line with some amount of curvature. The maybe simplest way to do this would be to use polynomial regression kind of approach, in which the numeric predictor is entered in the rhs as a polynomial to the n-th degree, with n-1 being, in a sense, the number of different curves/turns.

The straight-line model we fit above is this:

summary(m_straight)

Call:
lm(formula = RE ~ 1 + PR, data = d)

Residuals:
    Min      1Q  Median      3Q     Max
-3.1881 -1.1411 -0.0121  1.5548  2.7758

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.1550     0.6128   5.149 8.33e-06 ***
PR            0.3758     0.1132   3.322  0.00199 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.762 on 38 degrees of freedom
Multiple R-squared:  0.225, Adjusted R-squared:  0.2046
F-statistic: 11.03 on 1 and 38 DF,  p-value: 0.001986
deviance(m_straight)
[1] 118.0379

A model with a polynomial allowing for one curve could be written like this:

summary(m_polyorth <- lm(
   RE ~ 1 + poly(PR, 2), # <---
   data=d))

Call:
lm(formula = RE ~ 1 + poly(PR, 2), data = d)

Residuals:
    Min      1Q  Median      3Q     Max
-1.6437 -0.3065  0.1777  0.4188  1.1970

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)    4.9678     0.1106   44.92  < 2e-16 ***
poly(PR, 2)1   5.8542     0.6994    8.37 4.62e-10 ***
poly(PR, 2)2   9.9969     0.6994   14.29  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6994 on 37 degrees of freedom
Multiple R-squared:  0.8812,    Adjusted R-squared:  0.8747
F-statistic: 137.2 on 2 and 37 DF,  p-value: < 2.2e-16
deviance(m_polyorth)
[1] 18.09907

Certainly a massive improvement over the straight-line model! That means the curvature pretty much has to be significant, and it is; note: the F-value of the anova comparison is the square of the t-value for poly(PR, 2)2 in the summary output.

anova(m_straight, m_polyorth, test="F")
Analysis of Variance Table

Model 1: RE ~ 1 + PR
Model 2: RE ~ 1 + poly(PR, 2)
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)
1     38 118.038
2     37  18.099  1    99.939 204.31 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
c("AIC of m_straight"=AIC(m_straight), "AIC of m_polyorth"=AIC(m_polyorth))
AIC of m_straight AIC of m_polyorth
        162.80012          89.79432 

What do the diagnostics say? They are not as good as those of m_bestbreakp, but certainly much better than the straight-line version m_straight:

par(mfrow=c(1,2))           # show the
plot(m_polyorth, which=1:2) # diagnostic
par(mfrow=c(1,1))           # plots
Figure 9: Model-diagnostic plots for m_polyorth

What does the fit look like? Figure 10 indicates it’s quite good though possibly not as good as the model with a breakpoint:

nd <- cbind(nd, predict(m_polyorth, newdata=nd, interval="confidence"))
colnames(nd)[9:11] <- c("fit.poly", "lwr.poly", "upr.poly")
plot2 <- function () {
   plot1()
   lines(x=nd$PR, y=nd$fit.poly, col="red", lty=2)
   polygon(col="#FF000020", border=NA,
      c(nd$PR, rev(nd$PR)),
      c(nd$lwr.poly, rev(nd$upr.poly)))
}; plot2()
Figure 10: Observations against predictions (m_straight & m_poly)

How could we check whether the model with the best breakpoint m_bestbreakp is indeed better than the polynomial m_polyorth? Because we can’t use anova

c("AIC of m_bestbreakp"=AIC(m_bestbreakp), "AIC of m_polyorth"=AIC(m_polyorth))
AIC of m_bestbreakp   AIC of m_polyorth
           42.62793            89.79432 

AIC shows that the model with the best breakpoint is much better than the polynomial!

But (i) how do we interpret the coefficients in the summary output of m_polyorth? And (ii) how does the polynomial does what it does? The short answer to (i) is, you don’t. The normal kind of polynomial coefficients we created here is pretty much uninterpretable and interpretation of the results is based on the visualization of the predictions as we have done before. This is because of the answer to (ii). Here are the model matrices for both models; the one for m_straight, fine, but the one for m_polyorth is not intuitive at all.

model.matrix(m_straight) %>% head
  (Intercept)   PR
1           1 0.98
2           1 0.98
3           1 1.52
4           1 2.14
5           1 2.61
6           1 3.05
model.matrix(m_polyorth) %>% head
  (Intercept) poly(PR, 2)1 poly(PR, 2)2
1           1   -0.2467255   0.26432964
2           1   -0.2467255   0.26432964
3           1   -0.2120568   0.15319737
4           1   -0.1722520   0.04578915
5           1   -0.1420774  -0.02125283
6           1   -0.1138289  -0.07277623

One reason why it’s unintuitive is that the default polynomials in R are orthogonal polynomials, with orthogonal meaning they are transformed such that they end up not being linearly correlated with each other (which should remind you of collinearity):

cor(
   model.matrix(m_polyorth)[,2],
   model.matrix(m_polyorth)[,3]) %>%
round(4)
[1] 0

But to appreciate the logic of polynomials a bit more, we can look at a non-orthogonalized version of these polynomials, which is a bit easier to comprehend; we define polynomials again but now use the additional argument raw=TRUE:

summary(m_polyraw <- lm(
   RE ~ 1 + poly(PR, 2, raw=TRUE), # <---
   data=d))

Call:
lm(formula = RE ~ 1 + poly(PR, 2, raw = TRUE), data = d)

Residuals:
    Min      1Q  Median      3Q     Max
-1.6437 -0.3065  0.1777  0.4188  1.1970

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)
(Intercept)               8.23176    0.43044   19.12  < 2e-16 ***
poly(PR, 2, raw = TRUE)1 -2.38307    0.19817  -12.03  2.4e-14 ***
poly(PR, 2, raw = TRUE)2  0.28062    0.01963   14.29  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6994 on 37 degrees of freedom
Multiple R-squared:  0.8812,    Adjusted R-squared:  0.8747
F-statistic: 137.2 on 2 and 37 DF,  p-value: < 2.2e-16
deviance(m_polyraw)
[1] 18.09907

You can see that everything stays the same – R2, F, residuals, … – so in that sense the model m_polyraw is equivalent to the model m_polyorth, but now check out the model matrix:

model.matrix(m_straight) %>% head
  (Intercept)   PR
1           1 0.98
2           1 0.98
3           1 1.52
4           1 2.14
5           1 2.61
6           1 3.05
model.matrix(m_polyraw)  %>% head
  (Intercept) poly(PR, 2, raw = TRUE)1 poly(PR, 2, raw = TRUE)2
1           1                     0.98                   0.9604
2           1                     0.98                   0.9604
3           1                     1.52                   2.3104
4           1                     2.14                   4.5796
5           1                     2.61                   6.8121
6           1                     3.05                   9.3025

So the raw version of a 2nd-degree polynomial of a numeric predictor is simply

  • using one column for the original predictor (column 2 of the model matrix of m_polyraw);
  • using another column for what is the square of the original predictor (column 3 of the model matrix of m_polyraw), this is what makes the line curved.

But since these columns are of course extremely highly correlated, …

cor(
   model.matrix(m_polyraw)[,2],
   model.matrix(m_polyraw)[,3]) %>%
round(4)
[1] 0.974

… the default way in which poly proceeds is to orthogonalize these two columns, which we have seen eliminates the high correlation between the two columns in the model matrix, but also is what makes the coefficients uninterpretable.

One natural follow-up question now is when to use regression with breakpoints and when a polynomial approach. Sometimes, you might be able to decide on the basis of fit and we have seen that the fit of the model with a breakpoint is much better than the polynomial model. However, sometimes you might also decide on the basis of the conceptual logic and hypotheses that underpin the analysis (maybe a breakpoint’s location follows perfectly well from previous work or it is easier, or harder, to motivate a single curved causal mechanism) – in other words, ‘it depends’.

1.2.2 A more fine-grained example

Let’s look at this with a simpler example, but in more detail. We take the reaction time data from the book, where the expectations were that

  • FAM would be negatively correlated with RT: the more familiar, the faster one reacts to the word;
  • FREQ would be negatively correlated with RT: the more frequent, the faster one reacts to the word.

We first log FREQ, …

Code
summary(x <- read.delim(   # summarize d, the result of loading
   file="_input/reactiontimes.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
        CASE          RT             FREQ          FAM       IMA
 almond   : 1   Min.   :523.0   Min.   :  1.00   hi  :12   hi  :30
 ant      : 1   1st Qu.:589.4   1st Qu.:  2.00   lo  :12   lo  :23
 apple    : 1   Median :617.7   Median :  4.00   med :31   NA's:24
 apricot  : 1   Mean   :631.9   Mean   :  9.26   NA's:22
 asparagus: 1   3rd Qu.:663.6   3rd Qu.: 10.00
 avocado  : 1   Max.   :794.5   Max.   :118.00
 (Other)  :71
      MEAN
 Min.   :315.0
 1st Qu.:409.8
 Median :437.5
 Mean   :436.0
 3rd Qu.:466.2
 Max.   :553.0
 NA's   :29     
Code
x <- droplevels(x[complete.cases(x[,c("RT", "FREQ", "FAM")]),])
x$FREQ <- log2(x$FREQ)

… and then apply a backwards model selection process to the initial model RT ~ poly(FREQ, 2) * FAM to identify a final model. Here’s the initial model:

summary(m_01 <- lm(RT ~ 1 + poly(FREQ, 2) * FAM, data=x))

Call:
lm(formula = RT ~ 1 + poly(FREQ, 2) * FAM, data = x)

Residuals:
    Min      1Q  Median      3Q     Max
-61.064 -28.493  -5.551  21.298 103.418

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)
(Intercept)            598.993     15.958  37.536   <2e-16 ***
poly(FREQ, 2)1        -120.676    109.030  -1.107    0.274
poly(FREQ, 2)2          98.103     75.744   1.295    0.202
FAMlo                  103.789     41.896   2.477    0.017 *
FAMmed                  12.553     17.876   0.702    0.486
poly(FREQ, 2)1:FAMlo   529.051    438.690   1.206    0.234
poly(FREQ, 2)2:FAMlo   507.894    353.190   1.438    0.157
poly(FREQ, 2)1:FAMmed    3.293    130.856   0.025    0.980
poly(FREQ, 2)2:FAMmed -147.037    114.527  -1.284    0.206
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43.69 on 46 degrees of freedom
Multiple R-squared:  0.3853,    Adjusted R-squared:  0.2783
F-statistic: 3.603 on 8 and 46 DF,  p-value: 0.002517
head(mm_01 <- model.matrix(m_01))
  (Intercept) poly(FREQ, 2)1 poly(FREQ, 2)2 FAMlo FAMmed poly(FREQ, 2)1:FAMlo
2           1     0.02551768   -0.118122338     0      1             0.000000
3           1     0.06595647   -0.106170170     0      0             0.000000
4           1    -0.11651702    0.003851096     1      0            -0.116517
5           1    -0.11651702    0.003851096     0      1             0.000000
7           1    -0.01263059   -0.110432541     0      0             0.000000
8           1     0.13872803   -0.032545275     0      1             0.000000
  poly(FREQ, 2)2:FAMlo poly(FREQ, 2)1:FAMmed poly(FREQ, 2)2:FAMmed
2          0.000000000            0.02551768          -0.118122338
3          0.000000000            0.00000000           0.000000000
4          0.003851096            0.00000000           0.000000000
5          0.000000000           -0.11651702           0.003851096
7          0.000000000            0.00000000           0.000000000
8          0.000000000            0.13872803          -0.032545275
drop1(m_01, test="F") %>% data.frame
                  Df Sum.of.Sq      RSS      AIC  F.value    Pr..F.
<none>            NA        NA 87814.86 423.6609       NA        NA
poly(FREQ, 2):FAM  4  9834.521 97649.38 421.4993 1.287903 0.2886233

Given this result of drop1, you are likely tempted to do this next:

summary(m_02a <- update(m_01, .~.
   - poly(FREQ, 2):FAM)) # remove what drop1 said is ns

Call:
lm(formula = RT ~ poly(FREQ, 2) + FAM, data = x)

Residuals:
    Min      1Q  Median      3Q     Max
-71.847 -31.963  -4.584  23.051 107.730

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
(Intercept)      597.96      14.09  42.449  < 2e-16 ***
poly(FREQ, 2)1   -93.27      49.81  -1.872  0.06701 .
poly(FREQ, 2)2    62.52      45.36   1.378  0.17430
FAMlo             59.25      20.52   2.888  0.00572 **
FAMmed            15.49      16.51   0.938  0.35251
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 44.19 on 50 degrees of freedom
Multiple R-squared:  0.3164,    Adjusted R-squared:  0.2617
F-statistic: 5.786 on 4 and 50 DF,  p-value: 0.0006604
anova(m_01, m_02a, test="F") %>% data.frame # same as drop1
  Res.Df      RSS Df Sum.of.Sq        F    Pr..F.
1     46 87814.86 NA        NA       NA        NA
2     50 97649.38 -4 -9834.521 1.287903 0.2886233

But note that the ns term returned by drop1 contains 2 ‘elements’, so to speak:

  • we went with keeping the curved effect but removing the interaction,
  • which means we lost 4 (!) columns in the model matrix:
head(mm_02a <- model.matrix(m_02a))
  (Intercept) poly(FREQ, 2)1 poly(FREQ, 2)2 FAMlo FAMmed
2           1     0.02551768   -0.118122338     0      1
3           1     0.06595647   -0.106170170     0      0
4           1    -0.11651702    0.003851096     1      0
5           1    -0.11651702    0.003851096     0      1
7           1    -0.01263059   -0.110432541     0      0
8           1     0.13872803   -0.032545275     0      1

So what about removing the curvature while still keeping the interaction (w/ FREQ as a straight line) in there?

summary(m_02b <- update(m_01, .~.
   - poly(FREQ, 2)*FAM # remove the curvature
   + FREQ*FAM))        # keep the interaction FAM w/ FREQ as a straight line

Call:
lm(formula = RT ~ FREQ + FAM + FAM:FREQ, data = x)

Residuals:
   Min     1Q Median     3Q    Max
-78.81 -31.94  -8.06  28.02  96.49

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  603.206     30.235  19.950   <2e-16 ***
FREQ          -3.089      7.151  -0.432   0.6676
FAMlo         94.757     37.781   2.508   0.0155 *
FAMmed        28.359     33.469   0.847   0.4009
FREQ:FAMlo   -19.599     14.137  -1.386   0.1719
FREQ:FAMmed   -4.594      8.787  -0.523   0.6035
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 44.61 on 49 degrees of freedom
Multiple R-squared:  0.3173,    Adjusted R-squared:  0.2476
F-statistic: 4.554 on 5 and 49 DF,  p-value: 0.001728
head(mm_02b <- model.matrix(m_02b))
  (Intercept)     FREQ FAMlo FAMmed FREQ:FAMlo FREQ:FAMmed
2           1 2.807355     0      1          0    2.807355
3           1 3.321928     0      0          0    0.000000
4           1 1.000000     1      0          1    0.000000
5           1 1.000000     0      1          0    1.000000
7           1 2.321928     0      0          0    0.000000
8           1 4.247928     0      1          0    4.247928

You can see from the results and the model matrices that m_02a and m_02b are not identical. So let’s compare things:

c("m_01"=AICc(m_01), "m_02a"=AICc(m_02a), "m_0b"=AICc(m_02b))
    m_01    m_02a     m_0b
586.7441 581.3325 583.8969 
anova(m_01, m_02a, test="F") %>% data.frame # rep for convenience
  Res.Df      RSS Df Sum.of.Sq        F    Pr..F.
1     46 87814.86 NA        NA       NA        NA
2     50 97649.38 -4 -9834.521 1.287903 0.2886233
anova(m_01, m_02b, test="F") %>% data.frame
  Res.Df      RSS Df Sum.of.Sq        F    Pr..F.
1     46 87814.86 NA        NA       NA        NA
2     49 97527.67 -3 -9712.808 1.695951 0.1810154

Given these results, we go with m_02a because it is ‘less significantly different’ from m_01 and has a lower AICc-value than m_01 and m_02b. We continue:

drop1(m_02a, test="F") %>% data.frame
              Df Sum.of.Sq       RSS      AIC  F.value      Pr..F.
<none>        NA        NA  97649.38 421.4993       NA          NA
poly(FREQ, 2)  2  11644.65 109294.02 423.6955 2.981239 0.059817286
FAM            2  19977.07 117626.45 427.7365 5.114489 0.009530436

Given this result of drop1, you are likely tempted to do this next:

summary(m_03 <- update(m_02a, .~.
   - poly(FREQ, 2))) # remove what drop1 said is ns

Call:
lm(formula = RT ~ FAM, data = x)

Residuals:
    Min      1Q  Median      3Q     Max
-84.370 -24.814  -3.342  22.482 131.164

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   591.39      13.23  44.685  < 2e-16 ***
FAMlo          71.90      18.72   3.842 0.000334 ***
FAMmed         22.26      15.59   1.428 0.159250
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 45.85 on 52 degrees of freedom
Multiple R-squared:  0.2349,    Adjusted R-squared:  0.2055
F-statistic: 7.982 on 2 and 52 DF,  p-value: 0.0009481
anova(m_02a, m_03, test="F") %>% data.frame # same as drop1
  Res.Df       RSS Df Sum.of.Sq        F     Pr..F.
1     50  97649.38 NA        NA       NA         NA
2     52 109294.02 -2 -11644.65 2.981239 0.05981729

But note that the ns term returned by drop1 contains 2 ‘elements’, so to speak:

  • we dropped a linear component of frequency (column 2 of mm_02a), and
  • we dropped a squared component of frequency (column 3 of mm_02a),
  • which means we lost 2 (!) columns in the model matrix:
head(mm_03 <- model.matrix(m_03))
  (Intercept) FAMlo FAMmed
2           1     0      1
3           1     0      0
4           1     1      0
5           1     0      1
7           1     0      0
8           1     0      1

Let’s test each. Here are the models and their model matrices …

summary(m_03a <- update(m_02a, .~.
   - poly(FREQ, 2) # remove what drop1 said is ns
+ FREQ))           # keep FREQ as a straight line

Call:
lm(formula = RT ~ FAM + FREQ, data = x)

Residuals:
    Min      1Q  Median      3Q     Max
-76.072 -28.680  -1.306  24.454 119.161

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  621.433     19.792  31.399   <2e-16 ***
FAMlo         53.858     20.317   2.651   0.0107 *
FAMmed        10.531     16.254   0.648   0.5199
FREQ          -7.854      3.930  -1.998   0.0510 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 44.58 on 51 degrees of freedom
Multiple R-squared:  0.2904,    Adjusted R-squared:  0.2487
F-statistic: 6.959 on 3 and 51 DF,  p-value: 0.0005159
head(mm_03a <- model.matrix(m_03a))
  (Intercept) FAMlo FAMmed     FREQ
2           1     0      1 2.807355
3           1     0      0 3.321928
4           1     1      0 1.000000
5           1     0      1 1.000000
7           1     0      0 2.321928
8           1     0      1 4.247928
summary(m_03b <- update(m_02a, .~.
   - poly(FREQ, 2) # remove what drop1 said is ns
   + I(FREQ^2)))   # keep FREQ squared

Call:
lm(formula = RT ~ FAM + I(FREQ^2), data = x)

Residuals:
    Min      1Q  Median      3Q     Max
-81.447 -27.900  -0.794  23.477 128.042

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 607.5629    18.1228  33.525  < 2e-16 ***
FAMlo        58.8473    21.1439   2.783  0.00753 **
FAMmed       13.2331    16.9777   0.779  0.43932
I(FREQ^2)    -0.9047     0.6976  -1.297  0.20050
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 45.55 on 51 degrees of freedom
Multiple R-squared:  0.2593,    Adjusted R-squared:  0.2157
F-statistic: 5.952 on 3 and 51 DF,  p-value: 0.001467
head(mm_03b <- model.matrix(m_03b))
  (Intercept) FAMlo FAMmed I(FREQ^2)
2           1     0      1  7.881242
3           1     0      0 11.035206
4           1     1      0  1.000000
5           1     0      1  1.000000
7           1     0      0  5.391350
8           1     0      1 18.044888

… and then we compare things again:

c("m_02a"=AICc(m_02a), "m_03a"=AICc(m_03a), "m_03b"=AICc(m_03b))
   m_02a    m_03a    m_03b
581.3325 580.8575 583.2186 
anova(m_02a, m_03a, test="F") %>% data.frame
  Res.Df       RSS Df Sum.of.Sq        F    Pr..F.
1     50  97649.38 NA        NA       NA        NA
2     51 101358.58 -1 -3709.203 1.899245 0.1742991
anova(m_02a, m_03b, test="F") %>% data.frame
  Res.Df       RSS Df Sum.of.Sq        F     Pr..F.
1     50  97649.38 NA        NA       NA         NA
2     51 105804.64 -1  -8155.26 4.175787 0.04629252

We go with m_03a because it is not significantly different from m_02a and has a lower AICc-value than the other models:

drop1(m_03a, test="F") %>% data.frame
       Df Sum.of.Sq      RSS      AIC  F.value     Pr..F.
<none> NA        NA 101358.6 421.5498       NA         NA
FAM     2 18303.144 119661.7 426.6800 4.604743 0.01450722
FREQ    1  7935.443 109294.0 423.6955 3.992830 0.05103968

And now we’re done because …? We said we expected that “FREQ would be negatively correlated with RT”, which it is, which means we can use a one-tailed p-value: instead of 0.0510397 we get 0.0255198, which is significant.

summary(m_03a)

Call:
lm(formula = RT ~ FAM + FREQ, data = x)

Residuals:
    Min      1Q  Median      3Q     Max
-76.072 -28.680  -1.306  24.454 119.161

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  621.433     19.792  31.399   <2e-16 ***
FAMlo         53.858     20.317   2.651   0.0107 *
FAMmed        10.531     16.254   0.648   0.5199
FREQ          -7.854      3.930  -1.998   0.0510 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 44.58 on 51 degrees of freedom
Multiple R-squared:  0.2904,    Adjusted R-squared:  0.2487
F-statistic: 6.959 on 3 and 51 DF,  p-value: 0.0005159

Bottom line: You may not always have to go through all the lengths shown here, but be aware of what you’re dropping/doing! Look, this is how complex it can get:

summary(model_eg <- lm(RT ~ 1 + poly(MEAN, 2, raw=TRUE) * poly(FREQ, 2, raw=TRUE), data=x))$coefficients
                                                           Estimate
(Intercept)                                            1.142024e+03
poly(MEAN, 2, raw = TRUE)1                            -1.766457e+00
poly(MEAN, 2, raw = TRUE)2                             1.363917e-03
poly(FREQ, 2, raw = TRUE)1                            -4.341068e+02
poly(FREQ, 2, raw = TRUE)2                             4.225957e+01
poly(MEAN, 2, raw = TRUE)1:poly(FREQ, 2, raw = TRUE)1  1.815728e+00
poly(MEAN, 2, raw = TRUE)2:poly(FREQ, 2, raw = TRUE)1 -1.916966e-03
poly(MEAN, 2, raw = TRUE)1:poly(FREQ, 2, raw = TRUE)2 -1.927481e-01
poly(MEAN, 2, raw = TRUE)2:poly(FREQ, 2, raw = TRUE)2  2.193717e-04
                                                        Std. Error    t value
(Intercept)                                           1.594448e+03  0.7162502
poly(MEAN, 2, raw = TRUE)1                            7.459812e+00 -0.2367964
poly(MEAN, 2, raw = TRUE)2                            8.750348e-03  0.1558700
poly(FREQ, 2, raw = TRUE)1                            1.358256e+03 -0.3196060
poly(FREQ, 2, raw = TRUE)2                            2.173183e+02  0.1944593
poly(MEAN, 2, raw = TRUE)1:poly(FREQ, 2, raw = TRUE)1 6.147412e+00  0.2953646
poly(MEAN, 2, raw = TRUE)2:poly(FREQ, 2, raw = TRUE)1 6.974018e-03 -0.2748725
poly(MEAN, 2, raw = TRUE)1:poly(FREQ, 2, raw = TRUE)2 9.480150e-01 -0.2033176
poly(MEAN, 2, raw = TRUE)2:poly(FREQ, 2, raw = TRUE)2 1.039401e-03  0.2110558
                                                       Pr(>|t|)
(Intercept)                                           0.4781041
poly(MEAN, 2, raw = TRUE)1                            0.8140541
poly(MEAN, 2, raw = TRUE)2                            0.8769395
poly(FREQ, 2, raw = TRUE)1                            0.7509729
poly(FREQ, 2, raw = TRUE)2                            0.8468257
poly(MEAN, 2, raw = TRUE)1:poly(FREQ, 2, raw = TRUE)1 0.7692822
poly(MEAN, 2, raw = TRUE)2:poly(FREQ, 2, raw = TRUE)1 0.7848652
poly(MEAN, 2, raw = TRUE)1:poly(FREQ, 2, raw = TRUE)2 0.8399439
poly(MEAN, 2, raw = TRUE)2:poly(FREQ, 2, raw = TRUE)2 0.8339427
head(model.matrix(model_eg))
  (Intercept) poly(MEAN, 2, raw = TRUE)1 poly(MEAN, 2, raw = TRUE)2
2           1                        415                     172225
3           1                        451                     203401
5           1                        442                     195364
7           1                        449                     201601
8           1                        412                     169744
9           1                        412                     169744
  poly(FREQ, 2, raw = TRUE)1 poly(FREQ, 2, raw = TRUE)2
2                   2.807355                   7.881242
3                   3.321928                  11.035206
5                   1.000000                   1.000000
7                   2.321928                   5.391350
8                   4.247928                  18.044888
9                   2.000000                   4.000000
  poly(MEAN, 2, raw = TRUE)1:poly(FREQ, 2, raw = TRUE)1
2                                              1165.052
3                                              1498.190
5                                               442.000
7                                              1042.546
8                                              1750.146
9                                               824.000
  poly(MEAN, 2, raw = TRUE)2:poly(FREQ, 2, raw = TRUE)1
2                                              483496.7
3                                              675683.5
5                                              195364.0
7                                              468103.0
8                                              721060.2
9                                              339488.0
  poly(MEAN, 2, raw = TRUE)1:poly(FREQ, 2, raw = TRUE)2
2                                              3270.715
3                                              4976.878
5                                               442.000
7                                              2420.716
8                                              7434.494
9                                              1648.000
  poly(MEAN, 2, raw = TRUE)2:poly(FREQ, 2, raw = TRUE)2
2                                               1357347
3                                               2244572
5                                                195364
7                                               1086902
8                                               3063011
9                                                678976

Here’s what the columns are:

  • column 1: intercept
  • column 2: MEAN
  • column 3: MEAN^2
  • column 4: FREQ
  • column 5: FREQ^2
  • column 6: MEAN * FREQ
  • column 7: MEAN^2 * FREQ
  • column 8: MEAN * FREQ^2
  • column 9: MEAN^2 * FREQ^2

1.3 Splines and/or GAMs

There are other, more sophisticated ways to model curvature. Polynomials are a good first shot and I nearly always try them first but they are sometimes not sensitive enough. One alternative is what are called splines (usually “natural cubic splines”), and in the simplest case they can be fit fairly easily like this:

library(splines)
summary(m_spline_2 <- lm(
   RE ~ 1 + ns(PR, knots=2),
   data=d))

Call:
lm(formula = RE ~ 1 + ns(PR, knots = 2), data = d)

Residuals:
    Min      1Q  Median      3Q     Max
-1.8326 -0.2414  0.1850  0.5269  1.4625

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)          6.5026     0.3264  19.921  < 2e-16 ***
ns(PR, knots = 2)1  -5.2920     0.7510  -7.046 2.46e-08 ***
ns(PR, knots = 2)2   5.2024     0.4106  12.670 5.01e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7686 on 37 degrees of freedom
Multiple R-squared:  0.8565,    Adjusted R-squared:  0.8487
F-statistic: 110.4 on 2 and 37 DF,  p-value: 2.524e-16

While the coefficients are again uninterpretable, this seems very similar to the quality of the polynomial model, but the polynomial is in fact a bit better, which is confirmed by the AIC-values as well:

AIC(m_spline_2)
[1] 97.34124
AIC(m_polyorth)
[1] 89.79432

Let’s compare the predictions of both; in Figure 11, we add the new result to all previous ones:

nd <- cbind(nd, predict(m_spline_2, newdata=nd, interval="confidence"))
colnames(nd)[12:14] <- c("fit.spli", "lwr.spli", "upr.spli")
plot3 <- function () {
   plot2()
   lines(nd$PR, nd$fit.spli, col="blue")
   polygon(col="#0000FF20", border=NA,
      c(nd$PR, rev(nd$PR)),
      c(nd$lwr.spli, rev(nd$upr.spli)))
}; plot3()
Figure 11: Observations against predictions (m_straight, m_bestbreakpoint, m_poly, m_splines)

For the sake of interpreting this, I am not sure the statistical difference here between the polynomial and the splines could be important conceptually, theoretically, or in any applied sense.

The final way to look at curvature is the most powerful but also the by far most complex one involves a method called generalized additive models (gams). On the surface, using this method is deceptively simple; using the package mgcv, the approach looks very much like the splines one:

library(mgcv)
summary(m_gam <- gam(
   RE ~ 1 + s(PR), data=d))

Family: gaussian
Link function: identity

Formula:
RE ~ 1 + s(PR)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  4.96775    0.06719   73.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
        edf Ref.df     F p-value
s(PR) 6.382  7.529 106.7  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.954   Deviance explained = 96.1%
GCV = 0.22146  Scale est. = 0.18059   n = 40

Variance explanation is excellent: R2 is greater than 0.95, but R2 of m_bestbreakp was a tiny bit better – AIC has a slight preference for m_gam:

AIC(m_breakp)
[1] 54.98579
AIC(m_gam)
[1] 53.65669

Also, we can also see that the smooth term – the thing responsible for the curvature – is highly significant and involves a degree of curvature that is equivalent to 6 knots (edf, for ‘empirical degrees of freedom’, the squiggliness the gam estimated from the data). But as before, these numerical results don’t tell us anything about interpretation, so we plot again, this time into Figure 12 (and without a function now because this will be the last plot):

nd <- cbind(nd, predict(m_gam, newdata=nd))
colnames(nd)[15] <- "fit.gam"
plot3(); lines(nd$PR, nd$fit.gam, col="darkgreen", lwd=3)
Figure 12: Observations against predictions (all models)

The results of the gam clearly support the regression with breakpoints – although, the big turning point of the gam is again closer to PR=5 … – and I leave it up to you to verify that diagnostics such as predicted-vs.-fitted look fine. You might therefore of course wonder why one wouldn’t always use a gam – the answer is what I alluded to above: gams are indeed extremely powerful and in the best of worlds we’d use them much more often, but the downside is that the complexity that comes with them makes them often really hard to apply. And in this rarest of cases, the gam is not consistently better than the regression with the best breakpoint (and any interpretation of tha gam’s result would probably be identical to those of the model with the best breakpoint.

1.4 Exercises

1.4.1 Exercise 1

Take the duration data, log DURATION as before, square root SURPRISAL as before, …

Code
rm(list=ls(all.names=TRUE))
pkgs2load <- c("car", "effects", "magrittr", "multcomp", "MuMIn")
suppressMessages(sapply(pkgs2load, library,
   character.only=TRUE, logical.return=TRUE, quietly=TRUE)); rm(pkgs2load)
     car  effects magrittr multcomp    MuMIn
    TRUE     TRUE     TRUE     TRUE     TRUE 
Code
summary(d <- read.delim("_input/wdurations.csv", stringsAsFactors=TRUE))
      CASE           CONV         SPEAKER         SEX          DURATION
 Min.   :   1   C08    : 873   S15    : 478   female:3203   Min.   :  14.0
 1st Qu.:1596   C05    : 831   S10    : 455   male  :3180   1st Qu.: 110.0
 Median :3192   C06    : 800   S11    : 414                 Median : 162.0
 Mean   :3192   C07    : 772   S14    : 410                 Mean   : 194.5
 3rd Qu.:4788   C04    : 737   S08    : 401                 3rd Qu.: 245.0
 Max.   :6383   C09    : 557   S16    : 395                 Max.   :1405.0
                (Other):1813   (Other):3830
   POSINTURN          LENGTH            FREQ         SURPRISAL
 Min.   : 1.000   Min.   : 1.000   Min.   : 0.00   Min.   : 0.00
 1st Qu.: 3.000   1st Qu.: 2.000   1st Qu.:12.50   1st Qu.: 4.00
 Median : 6.000   Median : 3.000   Median :15.00   Median : 6.00
 Mean   : 5.524   Mean   : 2.768   Mean   :14.04   Mean   : 6.17
 3rd Qu.: 8.000   3rd Qu.: 3.000   3rd Qu.:17.00   3rd Qu.: 8.00
 Max.   :10.000   Max.   :14.000   Max.   :18.00   Max.   :21.00

      CLASS       VOWELS
 closed  :4021   one :3698
 number  : 125   two+:2685
 open    :2086
 propname: 110
 unclear :  41

                            
Code
d$DURATION.l <- log2(d$DURATION)
d$SURPRISAL.s <- sqrt(d$SURPRISAL)

… and then determine whether the relationship between DURATION.l and SURPRISAL.s is better captured with a straight line or a polynomial to the second degree.

plot(pch=16, col="#00000020",
   x=jitter(d$SURPRISAL.s),
   y=       d$DURATION.l ); grid()
summary(m_straight <- lm(DURATION.l ~ 1 + SURPRISAL.s, data=d))

Call:
lm(formula = DURATION.l ~ 1 + SURPRISAL.s, data = d)

Residuals:
    Min      1Q  Median      3Q     Max
-3.1210 -0.4686 -0.0012  0.4710  3.5814

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  5.92162    0.03118  189.91   <2e-16 ***
SURPRISAL.s  0.60160    0.01255   47.92   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7432 on 6381 degrees of freedom
Multiple R-squared:  0.2647,    Adjusted R-squared:  0.2646
F-statistic:  2297 on 1 and 6381 DF,  p-value: < 2.2e-16
abline(m_straight, col="red")

# show the diagnostic plots:
par(mfrow=c(1,2))
plot(m_straight, which=1:2)

par(mfrow=c(1,1))

summary(m_polyorth <- lm(
   DURATION.l ~ 1 + poly(SURPRISAL.s, 2),
   data=d))

Call:
lm(formula = DURATION.l ~ 1 + poly(SURPRISAL.s, 2), data = d)

Residuals:
    Min      1Q  Median      3Q     Max
-3.0615 -0.4616  0.0008  0.4703  3.4330

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)
(Intercept)            7.347946   0.009166  801.68   <2e-16 ***
poly(SURPRISAL.s, 2)1 35.617082   0.732276   48.64   <2e-16 ***
poly(SURPRISAL.s, 2)2 10.165338   0.732276   13.88   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7323 on 6380 degrees of freedom
Multiple R-squared:  0.2862,    Adjusted R-squared:  0.286
F-statistic:  1279 on 2 and 6380 DF,  p-value: < 2.2e-16
m_polyraw <- lm(DURATION.l ~ 1 + poly(SURPRISAL.s, 2, raw=TRUE), data=d)

anova(m_straight, m_polyorth, test="F") # same as anova(m_straight, m_polyraw, test="F")
Analysis of Variance Table

Model 1: DURATION.l ~ 1 + SURPRISAL.s
Model 2: DURATION.l ~ 1 + poly(SURPRISAL.s, 2)
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)
1   6381 3524.5
2   6380 3421.1  1    103.33 192.71 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(m_straight)
[1] 14329.25
AIC(m_polyorth) # rel lik: 6.463901e+40
[1] 14141.31
head(model.matrix(m_straight)) # recap
  (Intercept) SURPRISAL.s
1           1    2.236068
2           1    2.449490
3           1    1.732051
4           1    1.732051
5           1    2.828427
6           1    2.645751
head(model.matrix(m_polyorth)) # recap
  (Intercept) poly(SURPRISAL.s, 2)1 poly(SURPRISAL.s, 2)2
1           1          -0.002276882          -0.007936962
2           1           0.001327994          -0.007446643
3           1          -0.010790163          -0.003973468
4           1          -0.010790163          -0.003973468
5           1           0.007728570          -0.003396895
6           1           0.004643019          -0.005856955
head(model.matrix(m_polyraw))  # it undoes our transformation ;-)
  (Intercept) poly(SURPRISAL.s, 2, raw = TRUE)1
1           1                          2.236068
2           1                          2.449490
3           1                          1.732051
4           1                          1.732051
5           1                          2.828427
6           1                          2.645751
  poly(SURPRISAL.s, 2, raw = TRUE)2
1                                 5
2                                 6
3                                 3
4                                 3
5                                 8
6                                 7
nd <- data.frame(
   SURPRISAL.s=seq(
      min(d$SURPRISAL.s),
      max(d$SURPRISAL.s),
      length.out=10))
nd <- cbind(nd, predict(
   m_polyraw, # or m_polyorth
   newdata=nd,
   interval="confidence"))
plot(pch=16, col="#00000020",
   x=jitter(d$SURPRISAL.s),
   y=       d$DURATION.l)
grid(); abline(m_straight, col="red")
lines(nd$SURPRISAL.s, nd$fit, col="#00FF00FF", lwd=2)
polygon(
   c(nd$SURPRISAL.s, rev(nd$SURPRISAL.s)),
   c(nd$lwr, rev(nd$upr)),
   col="#00FF0020", border=NA)

1.4.2 Exercise 2

Hilpert (2011) discusses the change in the frequency of the keep V-ing construction in 18 years of the TIME Magazine corpus; the data are in _input/keepVing.csv:

rm(list=ls(all.names=TRUE))
pkgs2load <- c("car", "effects", "emmeans", "multcomp")
suppressMessages(sapply(pkgs2load, library,
   character.only=TRUE, logical.return=TRUE, quietly=TRUE)); rm(pkgs2load)
     car  effects  emmeans multcomp
    TRUE     TRUE     TRUE     TRUE 
summary(d <- read.delim("_input/keepVing.csv", stringsAsFactors=TRUE))
      YEAR         CORPSIZE            TOKENS         TOKpmw
 Min.   :1990   Min.   :20098563   Min.   :1507   Min.   :73.02
 1st Qu.:1994   1st Qu.:20521914   1st Qu.:1786   1st Qu.:87.15
 Median :1998   Median :20626230   Median :1834   Median :89.11
 Mean   :1998   Mean   :20597131   Mean   :1828   Mean   :88.76
 3rd Qu.:2003   3rd Qu.:20742853   3rd Qu.:1916   3rd Qu.:92.88
 Max.   :2007   Max.   :20993988   Max.   :1982   Max.   :96.18  

You can see what the variables/columns contain in the file _input/keepVing.r. Fit a nice regression to model the normalized frequency of the construction over time.

plot(pch=16, col="#00000020",
   x=jitter(d$YEAR),
   y=       d$TOKpmw); grid()
summary(m_straight <- lm(TOKpmw ~ 1 + YEAR, data=d))

Call:
lm(formula = TOKpmw ~ 1 + YEAR, data = d)

Residuals:
    Min      1Q  Median      3Q     Max
-10.153  -2.800   1.169   2.872   7.050

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -1400.4890   427.3645  -3.277  0.00474 **
YEAR            0.7452     0.2138   3.485  0.00306 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.707 on 16 degrees of freedom
Multiple R-squared:  0.4315,    Adjusted R-squared:  0.3959
F-statistic: 12.14 on 1 and 16 DF,  p-value: 0.003061
abline(m_straight, col="red")

# show the diagnostic plots:
par(mfrow=c(1,2))
plot(m_straight, which=1:2)

par(mfrow=c(1,1))

# regression with breakpoints
possible_breakpoints <- setNames( # make possible_breakpoints
   # the middles between each value & the next
   (d$YEAR[-1] + d$YEAR[-length(d$YEAR)]) / 2,
   # & assign as names the lower and the upper values of each pair:
   paste(d$YEAR[1:17], d$YEAR[2:18], sep="-"))
possible_deviances <- setNames(
   rep(NA, length(possible_breakpoints)),
   possible_breakpoints)
for (i in seq(possible_breakpoints)) { # loop over the breakpoints
   d$YEAR_bin <- factor(ifelse( # create a binary predictor at breakpoint
      d$YEAR>possible_breakpoints[i], "gtbp", "ltbp")) # w/ these levels
   # collect the deviance when that predictor is used in the model:
   possible_deviances[i] <- lm(TOKpmw ~ 1 + YEAR*YEAR_bin, data=d) %>% deviance
}

sort(possible_deviances)
  2000.5   1999.5   1996.5   1991.5   1994.5   1993.5   1992.5   1995.5
157.5323 162.2327 167.0069 168.8084 169.0062 169.3934 172.0671 172.6698
  1997.5   1998.5   2001.5   2004.5   2002.5   1990.5   2003.5   2005.5
194.7476 199.6438 228.5535 300.7715 308.0384 329.0570 336.1889 347.5116
  2006.5
353.3920 
breakpoint <- 1996.5 # I am not picking the best one here -- why?
plot(type="l",
   xlim=c(1990,  2007), x=possible_breakpoints,
   ylim=c(0, 350), y=possible_deviances); grid()
arrows(breakpoint, 250, breakpoint, possible_deviances[as.character(breakpoint)])
text(breakpoint, 250, breakpoint, pos=3)

d$YEAR_bin <- factor(
   ifelse(d$YEAR>breakpoint, "gtbp", "ltbp"), levels=c("ltbp", "gtbp"))
summary(m_breakp <- lm( # fit and
   TOKpmw ~ 1 +         # summarize the
   YEAR + YEAR_bin + YEAR:YEAR_bin, # model w/ the
   data=d))             # new predictors

Call:
lm(formula = TOKpmw ~ 1 + YEAR + YEAR_bin + YEAR:YEAR_bin, data = d)

Residuals:
    Min      1Q  Median      3Q     Max
-5.5289 -2.5491  0.6309  2.3026  5.0605

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)
(Intercept)       -5895.2245  1300.8636  -4.532 0.000470 ***
YEAR                  3.0004     0.6527   4.597 0.000415 ***
YEAR_bingtbp       5774.2274  1458.3893   3.959 0.001425 **
YEAR:YEAR_bingtbp    -2.8943     0.7311  -3.959 0.001427 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.454 on 14 degrees of freedom
Multiple R-squared:  0.7322,    Adjusted R-squared:  0.6748
F-statistic: 12.76 on 3 and 14 DF,  p-value: 0.0002718
# very small loss of R-squared for the not quite optimal breakpoint
par(mfrow=c(1,2))         # show the
plot(m_breakp, which=1:2) # diagnostic

par(mfrow=c(1,1))         # plots
# still not perfect (but this is 18 data points ...)

anova(m_straight, m_breakp, test="F")
Analysis of Variance Table

Model 1: TOKpmw ~ 1 + YEAR
Model 2: TOKpmw ~ 1 + YEAR + YEAR_bin + YEAR:YEAR_bin
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)
1     16 354.49
2     14 167.01  2    187.48 7.8581 0.005152 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(m_straight); AIC(m_breakp)
[1] 110.7272
[1] 101.1797
# significantly better than the straight line, of course

d <- cbind(d,
   predict(m_breakp, newdata=d, interval="confidence"))
colnames(d)[6:8] <- c("fit.brk", "lwr.brk", "upr.brk")
plot1 <- function() {
   # observed data
   plot(pch=16, col=ifelse(d$YEAR_bin=="ltbp", "#FF00FFFF", "#00FF00FF"),
      x=d$YEAR, y=d$TOKpmw); grid(); abline(v=breakpoint, lty=3)
   abline(m_straight, lty=2) # the straight-line regression model
   # predicted values
   lines(x=d$YEAR[d$YEAR_bin=="ltbp"], y=d$fit.brk[d$YEAR_bin=="ltbp"], pch=16, cex=0.5, col="#FF00FFFF")
   with(d[d$YEAR_bin=="ltbp",],
      polygon(border=NA, col="#FF00FF30",
         c(YEAR, rev(YEAR)), c(lwr.brk, rev(upr.brk))))
   lines(x=d$YEAR[d$YEAR_bin=="gtbp"], y=d$fit.brk[d$YEAR_bin=="gtbp"], pch=16, cex=0.5, col="#00FF00FF")
   with(d[d$YEAR_bin=="gtbp",],
      polygon(border=NA, col="#00FF0030",
         c(YEAR, rev(YEAR)), c(lwr.brk, rev(upr.brk))))
}; plot1()

# spline
library(splines)
summary(m_spline <- lm(
   TOKpmw ~ 1 + ns(YEAR, knots=2),
   data=d))

Call:
lm(formula = TOKpmw ~ 1 + ns(YEAR, knots = 2), data = d)

Residuals:
    Min      1Q  Median      3Q     Max
-10.153  -2.800   1.169   2.872   7.050

Coefficients: (1 not defined because of singularities)
                     Estimate Std. Error t value Pr(>|t|)
(Intercept)            147.88      17.00   8.698 1.84e-07 ***
ns(YEAR, knots = 2)1  -118.61      34.04  -3.485  0.00306 **
ns(YEAR, knots = 2)2       NA         NA      NA       NA
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.707 on 16 degrees of freedom
Multiple R-squared:  0.4315,    Adjusted R-squared:  0.3959
F-statistic: 12.14 on 1 and 16 DF,  p-value: 0.003061
AIC(m_breakp); AIC(m_spline) # spline is much worse
[1] 101.1797
[1] 110.7272
# gam
library(mgcv)
summary(m_gam <- gam(
   TOKpmw ~ 1 + s(YEAR), data=d))

Family: gaussian
Link function: identity

Formula:
TOKpmw ~ 1 + s(YEAR)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  88.7569     0.7562   117.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
         edf Ref.df     F  p-value
s(YEAR) 3.45  4.276 10.31 0.000374 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.719   Deviance explained = 77.6%
GCV = 13.672  Scale est. = 10.292    n = 18
AIC(m_breakp); AIC(m_gam) # gam is better, but might be too fine-grained!
[1] 101.1797
[1] 98.8352
# note how I compute the confidence interval for the gam predictions:
qwe <- predict(m_gam, newdata=d, se.fit=TRUE) # w/ std. errors ...
d <- cbind(d, "fit.gam"=qwe$fit,              # ... from the
   "lwr.gam"=qwe$fit+qnorm(0.025)*qwe$se.fit, # normal
   "upr.gam"=qwe$fit+qnorm(0.975)*qwe$se.fit) # distribution
plot1(); lines(d$YEAR, d$fit.gam, col="blue", lwd=3)
polygon(border=NA, col="#0000FF10",
   c(d$YEAR, rev(d$YEAR)), c(d$lwr.gam, rev(d$upr.gam)))

Additional question: if we go with m_breakp here: What are the slopes of YEAR before and after the breakpoint, i.e. during the ascent of the construction and when it seems to level off? Does it really

  • ‘ascend’ (in the sense that the slope of YEAR before the breakpoint is indeed significantly greater than 0)?
  • ’level off (in the sense that the slope of YEAR after the breakpoint is indeed not significantly different from 0 anymore)?

In other words, what are the two slopes of YEAR before and after the breakpoint and are they significantly different from 0. (We know they are significantly different from each other from the model summary.) If this looks like a glht question to you, good for you!

2 Session info

sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  compiler
[8] methods   base

other attached packages:
 [1] emmeans_2.0.1   mgcv_1.9-4      nlme_3.1-168    MuMIn_1.48.11
 [5] multcomp_1.4-29 TH.data_1.1-5   MASS_7.3-65     survival_3.8-6
 [9] mvtnorm_1.3-3   effects_4.2-4   car_3.1-3       carData_3.0-5
[13] STGmisc_1.06    Rcpp_1.1.1      magrittr_2.0.4

loaded via a namespace (and not attached):
 [1] sandwich_3.1-1     lattice_0.22-7     lme4_1.1-38        digest_0.6.39
 [5] evaluate_1.0.5     grid_4.5.2         estimability_1.5.1 fastmap_1.2.0
 [9] jsonlite_2.0.0     Matrix_1.7-4       nnet_7.3-20        DBI_1.2.3
[13] Formula_1.2-5      codetools_0.2-20   abind_1.4-8        reformulas_0.4.3.1
[17] Rdpack_2.6.5       cli_3.6.5          mitools_2.4        rlang_1.1.7
[21] rbibutils_2.4.1    yaml_2.3.12        otel_0.2.0         tools_4.5.2
[25] coda_0.19-4.1      nloptr_2.2.1       minqa_1.2.8        colorspace_2.1-2
[29] boot_1.3-32        stats4_4.5.2       zoo_1.8-15         htmlwidgets_1.6.4
[33] insight_1.4.5      xfun_0.56          rstudioapi_0.18.0  knitr_1.51
[37] xtable_1.8-4       htmltools_0.5.9    survey_4.4-8       rmarkdown_2.30