Predictive modeling @ JLU Giessen: tree-based approaches

Author

Affiliations

Stefan Th. Gries

UC Santa Barbara

JLU Giessen

Published

03 Aug 2025 12-34-56

1 Trees

1.1 Classification and regression trees

We’ll start our discussion of trees by revisiting an example that we discussed before in the context of glms: the choice of a genitive construction with the data in _input/genitives.csv and information about the variables/columns in _input/genitives.r. As before, our response variable is GENITIVE:

rm(list=ls(all=TRUE)); invisible(gc())
set.seed(sum(utf8ToInt("Gummibärchen")))
library(pdp); library(tree)
invisible(sapply(dir("_helpers", full.names=TRUE), source))
summary(d <- read.delim(        # summarize d, the result of loading
   file="_input/genitives.csv", # this file
   stringsAsFactors=TRUE))      # change categorical variables into factors

      CASE      GENITIVE  SPEAKER       MODALITY      POR_LENGTH
 Min.   :   2   of:2720   nns:2666   spoken :1685   Min.   :  1.00
 1st Qu.:1006   s : 880   ns : 934   written:1915   1st Qu.:  8.00
 Median :2018                                       Median : 11.00
 Mean   :2012                                       Mean   : 14.58
 3rd Qu.:3017                                       3rd Qu.: 17.00
 Max.   :4040                                       Max.   :204.00
   PUM_LENGTH         POR_ANIMACY   POR_FINAL_SIB        POR_DEF
 Min.   :  2.00   animate   : 920   absent :2721   definite  :2349
 1st Qu.:  6.00   collective: 607   present: 879   indefinite:1251
 Median :  9.00   inanimate :1671
 Mean   : 10.35   locative  : 243
 3rd Qu.: 13.00   temporal  : 159
 Max.   :109.00

d$LEN_PORmPUM_LOG <- log2(d$POR_LENGTH)-log2(d$PUM_LENGTH)
d <- d[,c(1:4,10,6:9,5)]

Let’s re-compute the baseline for and deviance of our response variable; you should revisit the first fixef part to see why I could compute the deviance here as I do:

(baseline <- max(          # make baseline the highest
   prop.table(             # proportion in the
      table(d$GENITIVE)))) # frequency table of the response
d$GENITIVE %>% table %>% prop.table %>%
   rep(table(d$GENITIVE)) %>% log %>%
   sum %>% "*"(-2) # deviance
# but you can also fit a 'null tree':
(cart_0 <- tree(GENITIVE ~ 1, data=d)) %>% deviance

[1] 0.7555556
[1] 4004.273
[1] 4004.273

Let’s begin with an extremely simple, monofactorial scenario, in which we’ll look at the question of whether the genitive choices (GENITIVE) are correlated with animacy of the possessor (POS_ANIMACY). We first do a simple descriptive cross-tabulation and then use the function tree::tree (Ripley 2019) to create a first classification tree, which we also immediately summarize:

addmargins(table(d$POR_ANIMACY, d$GENITIVE))


               of    s  Sum
  animate     370  550  920
  collective  408  199  607
  inanimate  1638   33 1671
  locative    199   44  243
  temporal    105   54  159
  Sum        2720  880 3600

summary(cart_1 <- tree( # summarize a tree called cart_1
   GENITIVE ~   # a classification tree of GENITIVE ~
   POR_ANIMACY, # POR_ANIMACY
   data=d))     # those variables are in d


Classification tree:
tree(formula = GENITIVE ~ POR_ANIMACY, data = d)
Number of terminal nodes:  3
Residual mean deviance:  0.7749 = 2787 / 3597
Misclassification error rate: 0.1944 = 700 / 3600

Let’s start at the bottom, where the results indicate that the tree makes 700 incorrect classifications and, thus, 3600-700=2900 correct ones, i.e. it has an accuracy of 0.8056. Next, the tree that the algorithm came up with has three terminal nodes – what does it look like?

plot(cart_1);grid() # plot the classification tree
axis(2); mtext("Deviance", 2, 3) # add a useful y-axis
text(cart_1,   # add labels to the tree
     pretty=4, # abbreviate the labels to 4 characters
     all=TRUE) # label all nodes & splits

Figure 1: The classification tree of cart_1

The way to read this is quite simple: we start at the top, the so-called root (node) and see a first split and then right below it you see of. That of means ‘if you don’t do this split, you should predict of (because it’s the more frequent level of GENITIVE)’. The expression above the split should be read as a question or a conditional expression: Is POR_ANIMACY inanimate? If ‘yes’, you go to the left and down that branch to the first terminal node, which ‘predicts’ of; if ‘no’, you go to the right and down that branch, which now features no inanimate possessors anymore, to the second split, which asks ‘is POR_ANIMACY collective, locative, or temporal?. If yes, you got to the left and again ’predict’ of, if no, you are left with only animate possessors and ‘predict’ s.

This information is also represented in the object cart_1 in tabular/text form:

cart_1

node), split, n, deviance, yval, (yprob)
      * denotes terminal node

1) root 3600 4004.0 of ( 0.75556 0.24444 )
  2) POR_ANIMACY: inanimate 1671  324.4 of ( 0.98025 0.01975 ) *
  3) POR_ANIMACY: animate,collective,locative,temporal 1929 2645.0 of ( 0.56091 0.43909 )
    6) POR_ANIMACY: collective,locative,temporal 1009 1223.0 of ( 0.70565 0.29435 ) *
    7) POR_ANIMACY: animate 920 1240.0 s ( 0.40217 0.59783 ) *

cart_1$frame # doesn't get displayed in render, must be a bug

          var    n       dev yval splits.cutleft splits.cutright   yprob.of
1 POR_ANIMACY 3600 4004.2730   of             :c           :abde 0.75555556
2      <leaf> 1671  324.3722   of                                0.98025135
3 POR_ANIMACY 1929 2645.4618   of           :bde              :a 0.56091239
6      <leaf> 1009 1222.9111   of                                0.70564916
7      <leaf>  920 1239.9452    s                                0.40217391
     yprob.s
1 0.24444444
2 0.01974865
3 0.43908761
6 0.29435084
7 0.59782609

Let’s now extend this now to something more realistic (by using all predictors) and also look at tree’s output a bit more comprehensively:

summary(cart_2 <- tree( # summarize an object called cart_2
   GENITIVE ~           # GENITIVE ~ all predictors
   LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF,
   data=d))            # those variables are in d


Classification tree:
tree(formula = GENITIVE ~ LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER +
    MODALITY + POR_FINAL_SIB + POR_DEF, data = d)
Variables actually used in tree construction:
[1] "POR_ANIMACY"     "LEN_PORmPUM_LOG"
Number of terminal nodes:  5
Residual mean deviance:  0.7066 = 2540 / 3595
Misclassification error rate: 0.1625 = 585 / 3600

We get the same info as before, but now also a first (coarse) indication of variable importance, namely a list of the variables that actually feature in the tree. From the absence of several variables in that list, we can infer most variables are actually not relevant to the tree. Let’s look at this in more detail:

cart_2

node), split, n, deviance, yval, (yprob)
      * denotes terminal node

 1) root 3600 4004.0 of ( 0.75556 0.24444 )
   2) POR_ANIMACY: inanimate 1671  324.4 of ( 0.98025 0.01975 ) *
   3) POR_ANIMACY: animate,collective,locative,temporal 1929 2645.0 of ( 0.56091 0.43909 )
     6) POR_ANIMACY: collective,locative,temporal 1009 1223.0 of ( 0.70565 0.29435 )
      12) LEN_PORmPUM_LOG < -0.522544 313  427.4 s ( 0.42812 0.57188 ) *
      13) LEN_PORmPUM_LOG > -0.522544 696  633.6 of ( 0.83046 0.16954 ) *
     7) POR_ANIMACY: animate 920 1240.0 s ( 0.40217 0.59783 )
      14) LEN_PORmPUM_LOG < 0.821928 618  752.6 s ( 0.29773 0.70227 ) *
      15) LEN_PORmPUM_LOG > 0.821928 302  402.3 of ( 0.61589 0.38411 ) *

cart_2$frame # doesn't get displayed in render, must be a bug

               var    n       dev yval splits.cutleft splits.cutright
1      POR_ANIMACY 3600 4004.2730   of             :c           :abde
2           <leaf> 1671  324.3722   of
3      POR_ANIMACY 1929 2645.4618   of           :bde              :a
6  LEN_PORmPUM_LOG 1009 1222.9111   of     <-0.522544      >-0.522544
12          <leaf>  313  427.4180    s
13          <leaf>  696  633.5778   of
7  LEN_PORmPUM_LOG  920 1239.9452    s      <0.821928       >0.821928
14          <leaf>  618  752.6407    s
15          <leaf>  302  402.2872   of
     yprob.of    yprob.s
1  0.75555556 0.24444444
2  0.98025135 0.01974865
3  0.56091239 0.43908761
6  0.70564916 0.29435084
12 0.42811502 0.57188498
13 0.83045977 0.16954023
7  0.40217391 0.59782609
14 0.29773463 0.70226537
15 0.61589404 0.38410596

Ok, so now looking at a plot might be more helpful … maybe:

plot(cart_2); grid() # plot the classification tree
axis(2); mtext("Deviance", 2, 3) # add a useful y-axis
text(cart_2,  # add labels to the tree
     pretty=4, # abbreviate the labels to 4 characters
     all=TRUE) # label all nodes & splits

Figure 2: The classification tree of cart_2

How do we interpret this?

of-genitives are predicted when
- node 2: possessors are inanimate (p_correct=0.98);
- node 13: possessors are {collective, locative, temporal} and LEN_PORmPUM_LOG is > -0.522544 (i.e. the possessor is mostly longer than the possessum and usually not at 3 units longer than the possessum) (p_correct=0.83);
- node 15: possessors are animate and LEN_PORmPUM_LOG is > 0.821928 (i.e. the possessor is at least 3 units longer than the possessum) (p_correct=0.616);
s-genitives are predicted when
- node 12: possessors are {collective, locative, temporal} and LEN_PORmPUM_LOG is < -0.522544 (i.e. the possessor is at least 2 units shorter than the possessum) (p_correct=0.572);
- node 14: possessors are animate and LEN_PORmPUM_LOG < 0.821928 (i.e. the possessor is mostly shorter than the possessum and usually not more than 4 units longer) (p_correct=0.70227).

So, let’s compute predictions and look at the tree’s performance in a bit more detail than just the accuracy. TBH, I hardly ever see the C-score reported for trees and forests, but of course that’s an option; similarly, I never see Cohen’s kappa or an R-squared reported for trees or forests – I don’t understand why no one is doing that, given that esp. McFadden’s R² is very easy to compute and makes a lot of conceptual sense (both in general and because, obviously, deviance can play a big role for trees):

d$PRED_PP_s <- predict(   # make d$PRED_PP_s the result of predicting
   cart_2                 # from cart_2
   )[,"s"]                # predicted probabilities of "s"
# if you don't say type="response", predict gives you log odds
d$PRED_CAT <- factor(         # make d$PRED_CAT the result of checking
   ifelse(d$PRED_PP_s>=0.5,   # if the predicted prob. of "s" is >=0.5
      levels(d$GENITIVE)[2],  # if yes, predict "s"
      levels(d$GENITIVE)[1]), # otherwise, predict "of"
   levels=levels(d$GENITIVE)) # make sure both levels are registered
(c_m <- table(           # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,   # observed genitives in the rows
   "PREDS"=d$PRED_CAT)) # predicted orders in the columns

    PREDS
OBS    of    s
  of 2402  318
  s   267  613

c( # evaluate the confusion matrix
   "Prec. for s"     =c_m[ "s", "s"] / sum(c_m[    , "s"]),
   "Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s",    ]),
   "Prec. for of"    =c_m["of","of"] / sum(c_m[    ,"of"]),
   "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of",    ]),
   "Acc. (overall)"  =mean(d$GENITIVE==d$PRED_CAT))

     Prec. for s  Acc./rec. for s     Prec. for of Acc./rec. for of
       0.6584318        0.6965909        0.8999625        0.8830882
  Acc. (overall)
       0.8375000

c("McFadden's R-squared"=(deviance(cart_0)-deviance(cart_2)) / deviance(cart_0),
  "Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(d$GENITIVE, d$PRED_PP_s))

McFadden's R-squared        Cohen's kappa                    C
           0.3656037            0.5685346            0.8793282

Pretty good actually!

detach(package:tree) # housekeeping

1.2 Conditional inference trees

An alternative tree-based approach that has gained some popularity in linguistic applications is that of conditional inference trees, in particular as implemented in the packages party (Hothorn et al. 2006) and partykit (Hothorn et al. 2015), which address concerns regarding the degree to which classification trees of the above kind can (i) overfit (which can be countered by pruning as alluded to above) and (ii) overestimate the importance of predictors with many different values (i.e. numeric predictors or categorical ones with many levels, see Strobl et al. 2009:342). This approach is conceptually similar to the kind of trees discussed above, but uses a p-value (adjusted for multiple tests) as a splitting criterion. Creating a conditional inference tree in R is (deceptively) simple, given everything we’ve discussed so far:

library(partykit) # the newer reimplementation of the package party
ctree_1 <- ctree(  # create an object called ctree_1
   GENITIVE ~      # GENITIVE ~ all predictors
   LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF,
   data=d)         # those variables are in d
plot(ctree_1)

Figure 3: The conditional inference tree of ctree_1

Actually, let’s tweak the plot a bit to make it more legible:

levels(ctree_1$data$POR_ANIMACY)   <- abbreviate(levels(ctree_1$data$POR_ANIMACY), 3)
levels(ctree_1$data$MODALITY)      <- c("spok", "writ")
levels(ctree_1$data$POR_DEF)       <- c("def", "indef")
levels(ctree_1$data$POR_FINAL_SIB) <- c("abs", "pres")
plot(ctree_1)

Figure 4: The conditional inference tree of ctree_1

Ever so slightly more complex … If you look at this tree, though, why might its additional complexity not really be that great?

s-genitives are really rare on the right, where inanimates and locatives are – s-genitives are only ever preferred with locatives and possessors that are notably shorter than possessums;
s-genitives are mostly preferred on the left and mostly with animate possessors unless the length difference prefers of-genitives more;
does this even have a significantly higher degree of accuracy or Cohen’s κ?

Let’s now evaluate this more formally:

d$PRED_PP_s <- predict(   # make d$PRED_PP_s the result of predicting
   ctree_1,               # from ctree_1
   type="prob")[,"s"]     # predicted probabilities of "s"
# if you don't say type="response", predict gives you log odds
d$PRED_CAT <- factor(         # make d$PRED_CAT the result of checking
   ifelse(d$PRED_PP_s>0.5,    # if the predicted prob. of "s" is >0.5 (note: >)
      levels(d$GENITIVE)[2],  # if yes, predict "s"
      levels(d$GENITIVE)[1]), # otherwise, predict "of"
   levels=levels(d$GENITIVE)) # make sure both levels are registered
# you also just use predict(ctree_1)
(c_m <- table(           # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,   # observed genitives in the rows
   "PREDS"=d$PRED_CAT)) # predicted orders in the columns

    PREDS
OBS    of    s
  of 2432  288
  s   257  623

c( # evaluate the confusion matrix
   "Prec. for s"     =c_m[ "s", "s"] / sum(c_m[    , "s"]),
   "Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s",    ]),
   "Prec. for of"    =c_m["of","of"] / sum(c_m[    ,"of"]),
   "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of",    ]),
   "Acc. (overall)"  =mean(d$GENITIVE==d$PRED_CAT))

     Prec. for s  Acc./rec. for s     Prec. for of Acc./rec. for of
       0.6838639        0.7079545        0.9044254        0.8941176
  Acc. (overall)
       0.8486111

c("McFadden's R-squared"=(deviance(cart_0)-2334.671) / deviance(cart_0),
  "Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(d$GENITIVE, d$PRED_PP_s))

McFadden's R-squared        Cohen's kappa                    C
           0.4169551            0.5949829            0.9042862

Is the accuracy of this tree (0.8486111) significantly better than that of the much simpler tree earlier (0.8375)? No (and the confidence interval of each Cohen’s κ includes the other one):

binom.test(3055, 3600, 3015/3600)$p.value # not really

[1] 0.0707569

However, one major problem of both kinds of trees is “their instability to small changes in the learning data […] the entire tree structure could be altered if the first splitting variable [something to which we will return in detail below], or only the first cutpoint, was chosen differently because of a small change in the learning data. Because of this instability, the predictions of single trees show a high variability” (Strobl et al. 2009:330). Another problem of trees is that “the prediction of single trees is piecewise constant and thus may jump from one value to the next even for small changes of the predictor values” (Strobl et al. 2009:331). Both of these problems can be dealt with to some degree using so-called ensemble methods such as random forests, to which we turn now.

detach(package:partykit) # housekeeping

2 Ensembles of trees: random forests

Random forests (Breiman 2001a, Hastie, Tibshirani, & Friedman 2014: Ch. 15, James et al. 2021: Section 8.2) are an instance of what are called ensemble methods because “an ensemble or committee of classification trees is aggregated for prediction” (Strobl et al. 2009:325). In other words, a random forest consists of ntree different trees, but with two additional layers of randomness, namely randomness

on the level of the data points because random forests involve running a tree on ntree different data sets that were bootstrapped – sampled with replacement – from the original data (see SFLWR³, Section 5.2.1.5) and then amalgamating all those ntree trees;
on the level of predictors because at every split in every tree, only a randomly-selected subset of the predictors is eligible to be chosen, namely mtry predictors.

Random forests usually overfit much less than individual trees or even training/test validation approaches and can be straightforwardly evaluated given that random forest accuracies are prediction, not classification, accuracies. Why/how is that? If you bootstrap with replacement from your data, then, like I mentioned above, each of the ntree trees will

be based only on the cases that were sampled (with replacement) this time around, i.e. their training set, which, on average, will be 63.2% of all your different cases);
not ‘have seen’ the, on average, remaining 100-63.2=36.8% of the cases that were not sampled for the current tree, i.e. cases that the current tree was not trained on, which are referred to as the held-out cases or as the OOB (out of bag) cases.

Random forests are often excellent at predictions. But there is also at least one disadvantage of random forests: While they excel at prediction (often outperforming regression models and individual trees in that sense), interpreting them can be hard: There is no single model, no single tree from a single data set that can be straightforwardly plotted – there’s a forest of, say, 500 different trees fit on 500 differently-sampled data sets, with thousands of different splits in different locations in trees with differently-sampled predictors available for splits. And because of precisely that I hope to convince you that the practice of trying to appealingly interpret/summarize such a forest of ntree trees with a single tree fitted to the whole data set is more than problematic; see James et al. (2021:343) for a comment regarding trees/forests to that effect:

“Recall that one of the advantages of decision trees is the attractive and easily interpreted diagram that results, such as the one displayed in Figure 8.1. However, when we bag a large number of trees, it is no longer possible to represent the resulting statistical learning procedure using a single tree […]”

With regard to the size or importance of an effect/predictor, the random forest implementations that seem to be most widely used in linguistics – randomForest::randomForest and party::cforest/partykit::cforest – offer the functionality of computing so-called variable importance scores, some version of which is regularly reported. Variable importance scores basically answer the question “how much (if at all) does a certain predictor contribute to predicting the response?”.

Warning: Note the exact formulation I’m using here: There’s an important reason why I did not simply say “how much does a certain predictor predict the response?” or “how much is a predictor correlated with the response?”.

With regard to the nature/direction of an effect, on the other hand, one can report partial dependence scores, which answer the question “how do values/levels of a certain predictor help predict the response?”, but for no good reason I see those much less often. Let’s now discuss how this all works; thankfully, in terms of R code, it’s all pretty (but again deceptively) straightforward.

2.1 Forests of classification and regression trees

Let’s revisit the genitive data using the maybe best-known implementation in randomForest::randomForest (Liaw & Wiener 2002). The following shows that we use the familiar formula notation, but we also tweak a few hyperparameters and arguments (the book discusses more):

ntree: how many trees we want the forest to have; the default is 500 but since the data set is not that big, we can straightforwardly increase that number because “[i]n general, the higher the number of trees, the more reliable the prediction and the interpretability of the variable importance” (Strobl et al. 2009:343);
mtry: how many predictors we want to be available at each split; the default for a forest of
- classification trees is the square root of the number of predictors in the formula
- regression trees is ¹/₃ of the number of predictors;
- note that “[w]hen predictor variables are highly correlated, […] a higher number of randomly preselected predictor variables is better suited to reflect conditional importance” (Strobl et al. 2009:343);
importance: whether we want predictor importance to be computed (the default is FALSE).

library(randomForest)
# set a replicable random number seed
set.seed(sum(utf8ToInt("Bacon bits"))); (rf_1 <- randomForest(GENITIVE ~
   LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF,
   data=d,           # the data are in d
   ntree=1500,       # how many trees to fit/how many data sets to ...
   mtry=3,           # how many variables are eligible at each split
   importance=TRUE)) # compute importance scores


Call:
 randomForest(formula = GENITIVE ~ LEN_PORmPUM_LOG + POR_ANIMACY +      SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF, data = d, ntree = 1500,      mtry = 3, importance = TRUE)
               Type of random forest: classification
                     Number of trees: 1500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 14.28%
Confusion matrix:
     of   s class.error
of 2452 268  0.09852941
s   246 634  0.27954545

The output shows that our forest obtained a prediction accuracy of 1-0.1428=0.8572 and that the classification/prediction error was much higher for s (0.28) than for of (0.099), but of course we want all individual predictions and the evaluation of our usual confusion matrix:

d$PRED_PP_s <- predict( # make d$PRED_PP_s the result of predicting
   rf_1,                # from rf_1
   type="prob")[,"s"]   # predicted probabilities of "s"
d$PRED_CAT <- predict(rf_1) # categorical predictions
d$PRED_PP_obs <- ifelse( # d$PRED_PP_obs is determined by ifelse
   d$GENITIVE=="s",      # if the obs genitive is the 2nd level of the response
     d$PRED_PP_s,        # take its predicted probability
   1-d$PRED_PP_s)        # otherwise take 1 minus its predicted probability
d$CONTRIBS2LL <- -log(d$PRED_PP_obs)
(c_m <- table(          # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,  # observed genitives in the rows
   "PREDS"=d$PRED_CAT)) # predicted orders in the columns

    PREDS
OBS    of    s
  of 2452  268
  s   246  634

c( # evaluate the confusion matrix
   "Prec. for s"     =c_m[ "s", "s"] / sum(c_m[    , "s"]),
   "Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s",    ]),
   "Prec. for of"    =c_m["of","of"] / sum(c_m[    ,"of"]),
   "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of",    ]),
   "Acc. (overall)"  =mean(d$GENITIVE==d$PRED_CAT))

     Prec. for s  Acc./rec. for s     Prec. for of Acc./rec. for of
       0.7028825        0.7204545        0.9088213        0.9014706
  Acc. (overall)
       0.8572222

c("Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(d$GENITIVE, d$PRED_PP_s))

Cohen's kappa             C
    0.6167103     0.8995413

Better than the trees but again not by that much. But what about R-squared? For that, we need to compute the deviance of the forest manually and the plug that into our usual equation for McFadden’s R²:

"-"(
   deviance(cart_0),
   sum(-log(d$PRED_PP_obs)) * 2) / deviance(cart_0)

[1] -Inf

?! Why doesn’t this work?

summary(d$PRED_PP_obs)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.0000  0.7665  0.9853  0.8258  1.0000  1.0000

How do we address this? By smoothing or clipping; here’s a very didactically detailed way to do this:

offset <- min(
   d$PRED_PP_obs %>% unique %>% sort       %>% "["(2),
   d$PRED_PP_obs %>% unique %>% sort(TRUE) %>% "["(2) %>% "-"(1, .)) / 2
d$PRED_PP_obs[d$PRED_PP_obs==0] <- offset
d$PRED_PP_obs[d$PRED_PP_obs==1] <- 1-offset
summary(d$PRED_PP_obs)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
0.0008251 0.7665302 0.9853076 0.8255153 0.9991749 0.9991749

Now it works:

"-"(
   deviance(cart_0),
   sum(-log(d$PRED_PP_obs)) * 2) / deviance(cart_0)

[1] 0.3107877

But which variables are driving this prediction accuracy most? We can get those from rf_1 with the function importance:

varimps <- rf_1$importance[,3:4] # make varimps the variable importances of rf_1
varimps[order(varimps[,1]),]

                MeanDecreaseAccuracy MeanDecreaseGini
MODALITY                 0.004675122         25.75982
SPEAKER                  0.005173870         26.56607
POR_FINAL_SIB            0.009581128         28.59611
POR_DEF                  0.013232235         37.75380
LEN_PORmPUM_LOG          0.049796542        340.39492
POR_ANIMACY              0.148589681        420.16544

Apparently, POR_ANIMACY is by far the most important variable, followed by LEN_PORmPUM_LOG, which nicely replicates the result from tree::tree.

I’ve seen many papers/talks that leave it at this or do something really (sorry …) self-contradictory, namely discuss the nature/direction of the effects in a monofactorial (e.g., chi-squared based) way, but we’ll do better and look at partial dependence scores, which we’ll compute with pdp::partial; when we do so, I always set the which.class argument to the second level of the response because I’m so used to that from regression modeling (just like I did above for computing d$PRED_PP_s). Let’s look at scores for one categorical predictor (POR_ANIMACY) and one numeric predictor (LEN_PORmPUM_LOG):

(pd_pa <- partial(         # make pd_pa contain partial dependence scores
   object=rf_1,            # from this forest
   pred.var="POR_ANIMACY", # for this predictor
   which.class=2,          # for the 2nd level of the response
   prob=TRUE,              # return results on the probability scale
   train=d))               # these were the training data

  POR_ANIMACY       yhat
1     animate 0.65402944
2  collective 0.22310852
3   inanimate 0.02442722
4    locative 0.13524333
5    temporal 0.27615389

pd_ld <- partial(              # make pd_ld contain partial dependence scores
   object=rf_1,                # from this forest
   pred.var="LEN_PORmPUM_LOG", # for this predictor
   which.class=2,              # for the 2nd level of the response
   prob=TRUE,                  # return results on the probability scale
   train=d)                    # these were the training data
pd_ld[sort(sample(nrow(pd_ld), 5)),] # here's a sample of 5 rows

   LEN_PORmPUM_LOG       yhat
24      -0.2741705 0.25516019
29       0.6496102 0.24428407
37       2.1276594 0.08156870
41       2.8666840 0.02448519
48       4.1599771 0.02430722

And then we could plot those:

par(mfrow=c(1, 2))
plot(pd_pa); grid()
plot(pd_ld, type="l"); grid()
par(mfrow=c(1, 1)) # reset to default

Figure 5: Partial dependence plots for rf_1 (basic)

This is very very similar to the effects plots we used for the regression results from before, and it’s interpretationally also very similar to the tree results. In my own work, I’d probably plot something like this, though – with bar widths reflecting the frequencies of the levels of POR_ANIMACY and point sizes reflecting the frequencies of the differences clause lengths:

Figure 6: Partial dependence plots for rf_1 (nicer)

No surprises there …

detach(package:randomForest) # housekeeping

2.2 Forests of conditional inference trees

Building a forests of conditional inference trees is only a bit different from the use of randomForest above:¹

library(partykit)
# set a replicable random number seed
set.seed(sum(utf8ToInt("Gummibären"))); cf_1 <- cforest(GENITIVE ~
   LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF,
   data=d,     # the data are in d
   ntree=1500, # how many trees to fit/how many data sets to ...
   perturb=list(replace=TRUE), # ... sample w/ replacement
   mtry=3)     # how many variables are eligible at each split
saveRDS(cf_1, file="_output/genitives_cf1.RDS")

As before, we can compute predictions and all the usual statistics:

d$PRED_PP_s <- predict( # make d$PRED_PP_s the result of predicting
   cf_1,                # from cf_1
   type="prob")[,"s"]   # predicted probabilities of "s"
d$PRED_CAT <- predict(cf_1) # categorical predictions
# below I will show you a faster way to do this
d$PRED_PP_obs <- ifelse( # d$PRED_PP_obs is determined by ifelse
   d$GENITIVE=="s",      # if the obs genitive is the 2nd level of the response
     d$PRED_PP_s,        # take its predicted probability
   1-d$PRED_PP_s)        # otherwise take 1 minus its predicted probability
d$CONTRIBS2LL <- -log(d$PRED_PP_obs)
(c_m <- table(          # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,  # observed genitives in the rows
   "PREDS"=d$PRED_CAT)) # predicted orders in the columns

    PREDS
OBS    of    s
  of 2484  236
  s   214  666

c( # evaluate the confusion matrix
   "Prec. for s"     =c_m[ "s", "s"] / sum(c_m[    , "s"]),
   "Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s",    ]),
   "Prec. for of"    =c_m["of","of"] / sum(c_m[    ,"of"]),
   "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of",    ]),
   "Acc. (overall)"  =mean(d$GENITIVE==d$PRED_CAT))

     Prec. for s  Acc./rec. for s     Prec. for of Acc./rec. for of
       0.7383592        0.7568182        0.9206820        0.9132353
  Acc. (overall)
       0.8750000

c("Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(d$GENITIVE, d$PRED_PP_s))

Cohen's kappa             C
    0.6644351     0.9407294

Similarly, we can compute variable importance scores with varimp, which computes importance measures whose logic is comparable to the permutation-accuracy one used in randomForest::importance above, where predictors’ associations with the response gets destroyed to measure the impact this has on prediction accuracy. However, with an argument called conditional you can choose how the permutation is done:

conditional=FALSE refers to the default of randomly permuting the column of interest;
conditional=TRUE means that the variable of interest is permuted in a way that takes its correlation with other predictors into consideration and can avoid exaggerating the importance of predictors that are correlated with other predictors (which is also why computing conditional scores takes considerably longer than computing unconditional ones or might not even be computable at all):

sort(varimp(cf_1)) # sort the variable importance scores of cf_1

       MODALITY         SPEAKER         POR_DEF   POR_FINAL_SIB LEN_PORmPUM_LOG
     0.02523194      0.03154644      0.04856007      0.05547140      0.28785335
    POR_ANIMACY
     1.08693322

# sort(varimp(cf_1,    # sort the variable importance scores of cf_1, but
#    conditional=TRUE, # the conditional version
#    cores=10))        # how many threads to use, adjust as needed (set to 1 on Windoze)

Let’s use the pdp package already introduced above to again compute partial dependence scores for just two predictors:

(pd_pa <- partial(         # make pd_pa contain partial dependence scores
   object=cf_1,            # from this forest
   pred.var="POR_ANIMACY", # for this predictor
   which.class=2,          # for the 2nd level of the response
   prob=TRUE,              # return results on the probability scale
   train=d))               # these were the training data

  POR_ANIMACY       yhat
1     animate 0.56373635
2  collective 0.26697231
3   inanimate 0.03756622
4    locative 0.15320667
5    temporal 0.29937002

pd_ld <- partial(              # make pd_ld contain partial dependence scores
   object=cf_1,                # from this forest
   pred.var="LEN_PORmPUM_LOG", # for this predictor
   which.class=2,              # for the 2nd level of the response
   prob=TRUE,                  # return results on the probability scale
   train=d)                    # these were the training data
pd_ld[seq(1, 51, 10),] # here's a sample of 5 rows

   LEN_PORmPUM_LOG       yhat
1        -4.523562 0.47971663
11       -2.676000 0.47877539
21       -0.828439 0.28549656
31        1.019123 0.16973386
41        2.866684 0.04317391
51        4.714246 0.04317425

And then we could plot those with either the standard plots …

par(mfrow=c(1, 2)) # make the plotting window have 1 rows & 2 columns
plot(pd_pa); grid()
plot(pd_ld, type="l"); grid()
par(mfrow=c(1, 1)) # reset to default

Figure 7: Partial dependence plots for cf_1 (basic)

… or our nicer ones:

Figure 8: Partial dependence plots for cf_1 (nicer)

Here, too, the overall picture is similar to what we found before – just make sure you write it up well: We need variable importance scores (to know what’s doing how much in our data) and partial dependence scores (to know what is doing what in our data). Let’s now look at another non-linguistic application.

3 Women and children first?

We will use a non-linguistic example here simply because it’s such a great data set and question: When the Titanic sank in 1912, does the pattern of who survived and who didn’t suggest that people were following the women-and-children-first principle? The Titanic data are available as a data frame of the usual kind, which contains this information.

Load the data and use a tree-based approach to study the above question.

rm(list=ls(all.names=TRUE)); invisible(gc())
summary(d <- read.delim("_input/titanic.csv", stringsAsFactors=TRUE))

      CASE       CLASS         SEX          AGE       SURVIVED
 Min.   :   1   1st :325   female: 470   adult:2092   no :1490
 1st Qu.: 551   2nd :285   male  :1731   child: 109   yes: 711
 Median :1101   3rd :706
 Mean   :1101   crew:885
 3rd Qu.:1651
 Max.   :2201

invisible(sapply(dir("_helpers", full.names=TRUE), source))

3.1 Exploration and preparation

Let’s do a very quick exploration just to get a feel for what’s going on:

ftable(d$CLASS, d$SEX, d$AGE  , d$SURVIVED) # perspective 1

                    no yes

1st  female adult    4 140
            child    0   1
     male   adult  118  57
            child    0   5
2nd  female adult   13  80
            child    0  13
     male   adult  154  14
            child    0  11
3rd  female adult   89  76
            child   17  14
     male   adult  387  75
            child   35  13
crew female adult    3  20
            child    0   0
     male   adult  670 192
            child    0   0

ftable(d$SEX  , d$AGE, d$CLASS, d$SURVIVED) # perspective 2

                    no yes

female adult 1st     4 140
             2nd    13  80
             3rd    89  76
             crew    3  20
       child 1st     0   1
             2nd     0  13
             3rd    17  14
             crew    0   0
male   adult 1st   118  57
             2nd   154  14
             3rd   387  75
             crew  670 192
       child 1st     0   5
             2nd     0  11
             3rd    35  13
             crew    0   0

Ok, so this data set could become tricky for regression modeling, depending on how one proceeds: There’s a structural zero in the data (no children in the crew) and the number of children in general is very small, which could lead to regression trouble.

Let’s also compute the baseline that our analysis will try to beat and the response’s deviance:

(baseline <- max(          # make baseline.1 the highest
   prop.table(             # proportion in the
      table(d$SURVIVED)))) # frequency table of the response variable

[1] 0.676965

deviance(glm(SURVIVED ~ 1, family=binomial, data=d))

[1] 2769.457

3.2 Fitting a conditional inference forest

Ok, let’s fit the forest:

set.seed(sum(utf8ToInt("Kackvogel"))); cf_1 <- cforest(SURVIVED ~ # a cond. inf. forest
   CLASS + SEX + AGE,          # w/ these predictors
   data=d,                     # the variables are in x
   ntree=750,                  # how many trees to fit/how many data sets to ...
   perturb=list(replace=TRUE), # ... sample w/ replacement
   mtry=2)                     # how many variables are eligible at each split

3.3 Evaluating the forest’s performance

How well does that forest do in terms of prediction accuracy, precision, recall, etc.?

d$PRED_PP_y <- predict( # make d$PRED_PP_y the result of predicting
   cf_1,                # from cf_1
   type="prob")[,"yes"] # predicted probabilities of "yes"
d$PRED_CAT <- predict(cf_1) # categorical predictions
(c_m <- table(OBS=d$SURVIVED, # cross-tabulate observed survival outcomes in the rows
   PREDS=d$PRED_CAT))         # predicted survival outcomes in the columns

     PREDS
OBS     no  yes
  no  1470   20
  yes  441  270

d$PRED_PP_obs <- ifelse( # d$PREDS.NUM.t.obs is determined by ifelse
   d$SURVIVED=="yes",    # if the obs outcome is the 2nd level of the response
     d$PRED_PP_y,        # take its predicted probability
   1-d$PRED_PP_y)        # otherwise take 1 minus its predicted probability
c( # evaluate the confusion matrix
   "Pred. acc."         =mean(d$SURVIVED==d$PRED_CAT),
   "Prec. for yes"=c_m["yes","yes"] / sum(c_m[     ,"yes"]),
   "Rec. for yes" =c_m["yes","yes"] / sum(c_m["yes",]),
   "Prec. for no" =c_m[ "no","no"]  / sum(c_m[     , "no"]),
   "Rec. for no"  =c_m[ "no","no"]  / sum(c_m[ "no",]))

   Pred. acc. Prec. for yes  Rec. for yes  Prec. for no   Rec. for no
    0.7905498     0.9310345     0.3797468     0.7692308     0.9865772

c("Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(d$SURVIVED, d$PRED_PP_y))

Cohen's kappa             C
    0.4334102     0.7681090

3.4 Variable importance(s)

Which variables are most important for making these predictions?

sort(varimp(cf_1))   # sort the variable importance scores of cf_1

       AGE      CLASS        SEX
0.08947584 0.15257832 0.30121848

sort(varimp(cf_1,    # sort the variable importance scores of cf_1, but
   conditional=TRUE, # the conditional version
   cores=10))        # how many threads to use, adjust as needed (set to 1 on Windoze)

       AGE      CLASS        SEX
0.08190308 0.14815233 0.22788570

Same order for both …

3.5 Partial dependence scores

And how are these variables related to the prediction of the second level of the response variable, SURVIVED: yes?

(pd_s <- partial(  # make pd.c contain partial dependence scores
   object=cf_1,    # from this forest
   pred.var="SEX", # for this predictor
   which.class=2,  # for the 2nd level of the response
   prob=TRUE,      # return results on the probability scale
   train=d))       # these were the training data

     SEX     yhat
1 female 0.744173
2   male 0.218873

(pd_a <- partial(  # make pd.c contain partial dependence scores
   object=cf_1,    # from this forest
   pred.var="AGE", # for this predictor
   which.class=2,  # for the 2nd level of the response
   prob=TRUE,      # return results on the probability scale
   train=d))       # these were the training data

    AGE      yhat
1 adult 0.3157287
2 child 0.4831514

(pd_c <- partial(    # make pd.c contain partial dependence scores
   object=cf_1,      # from this forest
   pred.var="CLASS", # for this predictor
   which.class=2,    # for the 2nd level of the response
   prob=TRUE,        # return results on the probability scale
   train=d))         # these were the training data

  CLASS      yhat
1   1st 0.4619725
2   2nd 0.3009891
3   3rd 0.2409572
4  crew 0.3548243

This is certainly looking like women-and-children first: We’re predicting SURVIVED: yes and

for SEX, the score for female is much much higher than that for male;
for AGE, the score for child is much higher than that for adult;
for CLASS, the scores go down as you get to lower booking classes, but crew members score between 1st and 2nd class.

I’m not sure this rises to a level of complexity that requires visualization, but SFLWR³ provides some plots for this. However, this might all be premature, as we will see now …

4 Excursus on importance scores and interactions

Trees and especially forests have become increasingly frequent in linguistic applications in part because this family of approaches is said to have several attractive/advantageous features. These include the perceptions (!) that

tree-based approaches are often seen as a good ‘plan B’, i.e. a methodological fall-back position when regression modeling is seen as too difficult or (near) impossible for a certain data set;
tree-based methods outperform regression modeling in many – most? – respects (and I’m hoping people mean that like I did above, namely as ‘outperform in terms of prediction’)
the visualization of at least trees is more intuitively interpretable than the summary/coefficients table usually reported for regression models;
trees are particularly good, or even better than regression modeling, at exploring/revealing interactions and at handling non-linearities;
fitting trees/forests seems ‘so easy’: Even for data that would be very hard to model with statistical-modeling tools such as regression methods and would, therefore, require much preparation, exploration, diagnostics, and (cross-)validation, machine-learning tools like trees seem to take one or two lines of code and then you have a random forest with a good prediction accuracy (without diagnostics, validation, etc.), and easy-to-obtain variable importance scores.

However, the situation is not quite as simple and clear-cut. First, the ease of interpretation of trees is often overstated, as can be seen when you compare the following two plots, which represent the result of a tree (left panel) and of a regression (right panel) addressing the question whether the fuel efficiency of a car (mpg) is correlated with its weight (wt) (in the mtcars data set):

Figure 9: A slope in a regression tree vs. in a regression

Also, note that the regression model makes more fine-grained predictions (and heterogeneity of predictions is in part a reflection of discriminatory/predictive power):

Figure 10: Predictions from a regression and the tree

Finally, with regard to ease of interpretability, some trees can be nightmarishly complex. You saw the conditional inference tree on the genitive data above that was extremely complex (and wasn’t even significantly better than a very small tree from tree::tree) and here is Figure 1 of Bernaisch, Gries, & Mukherjee (2014:17), which is also not exactly fun:

Figure 11: Figure 1 of Bernaisch, Gries, & Mukherjee (2014:17)

Another, much more important issue is concerned with the notion of interactions, in particular their identifiability and their relation to variable importance scores. Let’s first consider how an interaction would be represented in a tree. Consider a scenario (from SFLWR³: Section 5.2.4.2) where you have

a numeric response LENGTH;
two binary predictors:
- CLAUSE: main vs. subord;
- GRAMREL: obj vs. subj.

Since interactions are technically defined as “something doesn’t do the same everywhere/always” (SFLWR³: Section 5.2.4.2), a possible interaction might be that

the difference between object and subject lengths is positive when CLAUSE is main but
the difference between object and subject lengths is negative when CLAUSE is subord.

In a conditional inference and a regression tree, this might look like this:

Figure 12: A 2-by-2 interaction in regression trees

Figure 13: A 2-by-2 interaction in regression trees

But there are cases where trees are not as great at interactions as one might think. As the depth/complexity of a tree grows, the interactions it can reveal become more and more intricate or specialized; it’s not uncommon to look at a more complex tree and see 4- or 5-way interactions, which are (i) extremely hard to describe and interpret and (ii) probably often not really explainable (let alone motivatable a priori). But sometimes it’s even worse: Sometimes, tree-based approaches don’t even see what they are supposedly so good at and/or provide quite importance results that are ambiguous and regularly misinterpreted.

Consider the following tiny data set of three binary predictors and one binary response (the s in the variable names stands for ‘small’):

P1s <- factor(rep(c("b","a","b","a"), c(1,7,9,3)) ) # predictor 1 small
P2s <- factor(rep(c("e","f","e"), c(6,10,4)))       # predictor 2 small
P3s <- factor(rep(c("m","n","m","n"), c(6,4,6,4)))  # predictor 3 small
Rs <- factor(rep(c("x","y"), c(10, 10)))            # response small
Ds <- data.frame(P1s, P2s, P3s, Rs)

The relation between the predictors and the response is such that P1s is most associated with the response Rs, P2s is less associated with Rs, and P3s is not associated with Rs at all:

table(P1s, Rs) # accuracy=0.7

   Rs
P1s x y
  a 7 3
  b 3 7

table(P2s, Rs) # accuracy=0.6

   Rs
P2s x y
  e 6 4
  f 4 6

table(P3s, Rs) # accuracy=0.5

   Rs
P3s x y
  m 6 6
  n 4 4

However, there’s an interaction effect here such that P2s:P3s predicts Rs perfectly, i.e. with an accuracy of 1 (100%):

ftable(      # make a flat contingency table
   P2s, P3s, # these things in the rows
   Rs)       # this in the columns

        Rs x y
P2s P3s
e   m      6 0
    n      0 4
f   m      0 6
    n      4 0

Why is this an interaction? Because, for instance, P3s: m doesn’t ‘do the same everywhere’:

it ‘promotes’ R2: x when P2s is e, but
it ‘promotes’ R2: y when P2s is f.

This data set exemplifies something that is at least relatable to the so-called XOR problem, “a situation where two variables show no main effect [not true of P2s here, though, which has a ‘main effect’] but a perfect interaction. In this case, because of the lack of a marginally detectable main effect, none of the variables may be selected in the first split of a classification tree, and the interaction may never be discovered” (Strobl et al. 2009:341). And indeed, tree::tree doesn’t see the perfectly predictive interaction – instead, it returns a tree with an accuracy of 0.75 that uses the interaction P1s:P3s (Note that rpart::rpart doesn’t fare better, neither does party(kit)::ctree, which makes no split at all.)

plot(tree_tree <- tree::tree(Rs ~ P1s+P2s+P3s)); grid() # plot the classification tree
axis(2); mtext("Deviance", 2, 3)    # add a useful y-axis
text(tree_tree, pretty=0, all=TRUE) # add all full labels to it
text(                  # plot text
   x=2.5, y=21,        # at these coordinates
   paste0("accuracy=", # this string, followed by the accuracy
           mean(predict(tree_tree, type="class")==Rs)))

Figure 14: A tree on the (small) example data

But then, maybe this is because the data set is so small? Let’s increase it by one order of magnitude (the l in the variable names stands for ‘larger’) and try again :

P1l <- rep(P1s, 10) # predictor 1 larger
P2l <- rep(P2s, 10) # predictor 2 larger
P3l <- rep(P3s, 10) # predictor 3 larger
Rl <- rep(Rs, 10)   # response larger
Dl <- data.frame(P1l, P2l, P3l, Rl)

plot(tree_tree <- tree::tree(Rl ~ P1l+P2l+P3l)); grid() # plot the classification tree
axis(2); mtext("Deviance", 2, 3)    # add a useful y-axis
text(tree_tree, pretty=0, all=TRUE) # add all full labels to it
text(                  # plot text
   x=5.5, y=125,       # at these coordinates
   paste0("accuracy=", # this string, followed by the accuracy
           mean(predict(tree_tree, type="class")==Rl)))

Figure 15: A tree on the (large) example data

Now we get perfect accuracy (also from rpart::rpart or party(kit)::ctree), but the result is still not great in terms of parsimony: We’re not easily shown that P1l is not needed at all – we’re essentially presented with a 3-way interaction when a 2-way interaction is all that’s needed. The reason is that “the split selection process in regular classification trees is only locally optimal in each node: A variable and cutpoint are chosen with respect to the impurity reduction they can achieve in a given node defined by all previous splits, but regardless of all splits yet to come” (Strobl et al. 2009:333, my emphasis). In other words, in all cases so far, the algorithm sees that P1 has the highest monofactorial association and runs with it, but it never ‘backtracks’ to see whether starting with a split on the ‘monofactorially inferior’ P2 or P3 would ultimately lead to an overall superior outcome. But this is of course where random forests should be able to help, because, in some of the trees of the forest,

the random sampling of the data might have (i) weakened the association of P1l to Rl and (ii) strengthened the association of P2l or P3l to Rl, maybe giving these monofactorially weaker predictors ‘a leg up’ to be chosen on the first split;
P1l will not be available for the first split, so that P2l will be chosen instead, and if then P3l is available for the second split, the tree will get perfect accuracy (and such results will augment these variables’ importance scores).

Indeed, this is what happens:

set.seed(sum(utf8ToInt("Knoblauchwürste"))); rf_l <- randomForest::randomForest(
   Rl ~ P1l+P2l+P3l, mtry=2)
mean(predict(rf_l)==Rl) # prediction accuracy
## [1] 1

Prediction accuracy goes up to 1, which is great, but look at these results:

# variable importance
randomForest::importance(rf_l)

    MeanDecreaseGini
P1l         13.13008
P2l         19.57731
P3l         50.19747

# compute partial dependence scores
(pd_1 <- partial(  # make pd_1 contain partial dependence scores
   object=rf_l,    # from this forest
   pred.var="P1l", # for this predictor
   which.class=2,  # for the 2nd level of the response
   prob=TRUE,      # return results on the probability scale
   train=Dl))      # these were the training data

  P1l   yhat
1   a 0.2706
2   b 0.5476

(pd_2 <- partial(  # make pd_2 contain partial dependence scores
   object=rf_l,    # from this forest
   pred.var="P2l", # for this predictor
   which.class=2,  # for the 2nd level of the response
   prob=TRUE,      # return results on the probability scale
   train=Dl))      # these were the training data

  P2l   yhat
1   e 0.4109
2   f 0.4598

(pd_3 <- partial(  # make pd_3 contain partial dependence scores
   object=rf_l,    # from this forest
   pred.var="P3l", # for this predictor
   which.class=2,  # for the 2nd level of the response
   prob=TRUE,      # return results on the probability scale
   train=Dl))      # these were the training data

  P3l   yhat
1   m 0.4442
2   n 0.4904

In a sense, the variable importance scores and partial dependence scores are still not great: The fact that P2l:P3l is what really matters is still nowhere to be seen and even though the variable importance scores for P3l are really high, its partial dependence scores are really low. And all that counterintuitive mess in a data set as small and actually simple as this! Does a conditional inference forest perform better?

set.seed(sum(utf8ToInt("Knoblauchwürste"))); cf_l <- cforest(
   Rl ~ P1l+P2l+P3l, mtry=2)
mean(predict(cf_l)==Rl) # prediction accuracy

[1] 1

varimp(cf_l)                   # variable importance

     P1l      P2l      P3l
2.351041 3.753965 4.853231

varimp(cf_l, conditional=TRUE) # variable importance (conditional)

     P1l      P2l      P3l
0.161569 3.521965 4.788406

(pd_1 <- partial(  # make pd_1 contain partial dependence scores
   object=cf_l,    # from this forest
   pred.var="P1l", # for this predictor
   which.class=2,  # for the 2nd level of the response
   prob=TRUE,      # return results on the probability scale
   train=Dl))      # these were the training data

  P1l      yhat
1   a 0.2689748
2   b 0.5722766

(pd_2 <- partial(  # make pd_2 contain partial dependence scores
   object=cf_l,    # from this forest
   pred.var="P2l", # for this predictor
   which.class=2,  # for the 2nd level of the response
   prob=TRUE,      # return results on the probability scale
   train=Dl))      # these were the training data

  P2l      yhat
1   e 0.4705819
2   f 0.4342012

(pd_3 <- partial(  # make pd_3 contain partial dependence scores
   object=cf_l,    # from this forest
   pred.var="P3l", # for this predictor
   which.class=2,  # for the 2nd level of the response
   prob=TRUE,      # return results on the probability scale
   train=Dl))      # these were the training data

  P3l      yhat
1   m 0.4544925
2   n 0.5021139

It does a bit better when the conditional variable importance scores are used because those are (nicely) small for P1l – but we still don’t get the one piece of information everyone would want, that P2l:P3l ‘does it all’. And we again have the seemingly counterintuitive result that the variable we know does nothing monofactorially – P3l – has low partial dependence scores (ok) but the by far highest variable importance score (?). So you need to understand the following (which is something I figure many users of forests do actually not know, judging from their discussions of forests – I dare you find a discussion of forest results in linguistics that makes the following point clear and follows through on its implications): Especially if we’re used to regression modeling, what we’re getting shown from the variable importance score looks like a main effect, like ‘P3l (on its own) does a lot!’, but what the variable importance score really means is ‘P3l is (somehow!) important for making good predictions’, where somehow means P3l might

theoretically be a strong main effect (which, here, we know it is not from the simple cross-tabulation at the beginning);
be involved in one or more strong two-way interactions with P1l and/or P2l, i.e. the ‘real’ effect of P3l would be, or would follow from, what in regression modeling we’d write as P1l:P3l and/or P2l:P3l;
be involved in a strong three-way interactions with P1l and P2l, i.e. the ‘real’ effect P3l would be in, or would follow from, P1l:P2l:P3l.

Thus, a high variable importance score for P3l can mean that any one (or more!) of the following things are important: P3l, P1l:P3l, P2l:P3l, or P1l:P2lP3l, but you’re not told which of these it is. But this fact about variable importance scores is something that most papers using random forests that I’ve seen seem to have not fully understood. Most papers I’ve seen arrive at a high variable importance score of some predictor like P3l and then interpret the effect of that ‘important’ variable in a monofactorial main-effect kind of way. Many studies that used random forests will most likely have made a big monofactorial deal out of the predictors for which they obtained a high variable importance score without ever realizing that the real reason for the predictor was its participation in an interaction, and you shouldn’t trust the discussion if the authors don’t make it clear how they followed up on the variable importance score(s).

So how can we address this? As I discuss in Gries (2020 or in SFLWR³: Section 7:3), we can address this by following

Forina et al. (2009), who “augment […] the prediction data with addition of complex combinations of the data” (Kuhn & Johnson 2013:48, also see p. 57f.);
Strobl et al. (2009:341), who state that “an interaction effect of two variables can also be explicitly added as a potential predictor variable”;
Baayen (2011:305f.), whose application of Naïve Discriminative Learning includes “pairs of features” as “the functional equivalent of interactions in a regression model”.

That means, we force the tree-based approach to consider the combined effect of P2l and P3l by explicitly including in the formula the combination/interaction of these two predictors or, more generally, all predictors in whose interactions one might be interested in. For this data set, it just means we create three new predictors:

P1lxP2l <- P1l:P2l; P1lxP3l <- P1l:P3l; P2lxP3l <- P2l:P3l
Dl <- cbind(Dl, P1lxP2l, P1lxP3l, P2lxP3l)

And this solves everything for the trees (perfect accuracy and they only use P2lxP3l) …

(tt_l <- tree::tree(Rl ~ P1l+P2l+P3l+P1lxP2l+P1lxP3l+P2lxP3l))

node), split, n, deviance, yval, (yprob)
      * denotes terminal node

1) root 200 277.3 x ( 0.5 0.5 )
  2) P2lxP3l: e:m,f:n 100   0.0 x ( 1.0 0.0 ) *
  3) P2lxP3l: e:n,f:m 100   0.0 y ( 0.0 1.0 ) *

(ct_l <- ctree(Rl ~ P1l+P2l+P3l+P1lxP2l+P1lxP3l+P2lxP3l))


Model formula:
Rl ~ P1l + P2l + P3l + P1lxP2l + P1lxP3l + P2lxP3l

Fitted party:
[1] root
|   [2] P2lxP3l in e:m, f:n: x (n = 100, err = 0.0%)
|   [3] P2lxP3l in e:n, f:m: y (n = 100, err = 0.0%)

Number of inner nodes:    1
Number of terminal nodes: 2

… and for the random forest …

set.seed(sum(utf8ToInt("Hanuta"))); rf_l <- randomForest::randomForest(
   Rl ~ P1l+P2l+P3l+P1lxP2l+P1lxP3l+P2lxP3l, mtry=2)
mean(predict(rf_l)==Rl) # accuracy

[1] 1

   randomForest::importance(rf_l) # variable importance

        MeanDecreaseGini
P1l             2.942730
P2l             4.237394
P3l             5.871964
P1lxP2l        11.071039
P1lxP3l        17.340106
P2lxP3l        54.049938

partial( # show partial dependence scores
   object=rf_l,        # from this forest
   pred.var="P2lxP3l", # for this predictor
   which.class=2,      # for the 2nd level of the response
   prob=TRUE,          # return results on the probability scale
   train=Dl)           # these were the training data

  P2lxP3l   yhat
1     e:m 0.1864
2     e:n 0.7658
3     f:m 0.6921
4     f:n 0.2291

… and for the conditional inference forest:

set.seed(sum(utf8ToInt("Lachsschinken"))); cf_l <- cforest(
   Rl ~ P1l+P2l+P3l+P1lxP2l+P1lxP3l+P2lxP3l, mtry=2)
   mean(predict(cf_l)==Rl) # accuracy

[1] 1

   varimp(cf_l) # variable importance

     P1l      P2l      P3l  P1lxP2l  P1lxP3l  P2lxP3l
2.589657 3.092083 3.028634 3.378003 3.563074 6.132993

partial( # show partial dependence scores
   object=cf_l,        # from this forest
   pred.var="P2lxP3l", # for this predictor
   which.class=2,      # for the 2nd level of the response
   prob=TRUE,          # return results on the probability scale
   train=Dl)           # these were the training data

  P2lxP3l      yhat
1     e:m 0.2285684
2     e:n 0.7717199
3     f:m 0.7022035
4     f:n 0.2036099

Now (i) every tree/forest returns perfect accuracy, (ii) the forests return the by far highest variable importance scores for P2lxP3l, and (iii) the partial dependence scores of that predictor reveal the interaction pattern we knew is in the data … finally! And if you also add the three way interaction P1lxP2lxP3l to the formula, then tree, ctree, and cforest all still recover that P2lxP3l is most important – only randomForest assigns a variable importance score to P1lxP2lxP3l that’s slightly higher than that of P2lxP3l).

Thus, bottom line, trees and especially more robust and powerful forests can be good at detecting interactions, but it’s nowhere near as simple as is often thought. One should always (i) compute a forest, its accuracy, precision, and recall, but also (ii) variable importance scores and (iii) partial dependence scores. High variable importance scores for a predictor can mean many things: don’t take them at face value and let them fool you into automatically assuming they reflect an important ‘main effect’ and then you plot/interpret that and you’re done – a high variable importance score only means ‘that predictor does something, somehow’ and you need to do follow-up analyses to find out what exactly that is. And studying that can be done with interaction predictors (or in other ways, see Gries 2020). In fact, Strobl et al. (2009:339) go so far as to advise “In general, however, the complex random forest model, involving high-order interactions and nonlinearity, should be compared to a simpler model (e.g., a linear or logistic regression model including only low-order interactions) whenever possible to decide whether the simpler, interpretable model would be equally adequate”. Did you catch that?? Random forests are the more complex models, because each tree of a forest can easily involve 3-, 4-, 5-, … way interactions that people would never dream of fitting in a regression model but that they rely unquestioningly in a forest, no questions asked. Don’t let the apparent simplicity of the R code for a forest trick you into believing that forests are the simpler method; part of why they are doing so much better in predictions than regressions is that they can involve very complex partitioning of the data.

Given this underappreciated (by many) complexity of random forests, Strobl et al. (2009:341) conclude that

random forests can at least serve as a benchmark predictor: If a linear or other parametric model with a limited number and degree of interaction terms can reach the (cross-validated or test sample) prediction accuracy of the more complex model, the extra complexity may be uncalled for and the simpler, interpretable model should be given preference. If, however, the prediction accuracy cannot be reached with the simpler model, and, for example, the high importance of a variable in a random forest is not reflected by its respective parameters in the simpler model, relevant nonlinear or interaction effects may be missing in the simpler model, and it may not be suited to grasping the complexity of the underlying process.

In other words, ‘consider doing both!’ – a regression and a tree-based method.

5 Women and children first? Part 2

Let’s revisit the Titanic then because, after having seen notable variable importance scores for SEX and AGE, the question becomes, what do these values for SEX and AGE reflect: main effects of, say, SEX (‘women first’!) and AGE (‘children first’!) or interactions that SEX and AGE participate in? Let’s find out and let’s include all pairwise interactions. We first add those interactions to the data frame x:

# add interaction predictors for the two main predictors of interest
d$CLASSxSEX <- droplevels(d$CLASS:d$SEX) # using droplevels to eliminate
d$CLASSxAGE <- droplevels(d$CLASS:d$AGE) # factor level combinations that
d$SEXxAGE   <- droplevels(d$SEX:d$AGE)   # aren't attested

5.1 Fitting a conditional inference forest

So now we fit the next forest and include those:

set.seed(sum(utf8ToInt("Kackvogel"))); cf_2 <- cforest(
   SURVIVED ~
   CLASS + SEX + AGE +              # w/ these predictors
   CLASSxSEX + CLASSxAGE + SEXxAGE, # w/ these interaction predictors
   data=d,                          # the variables are in d
   ntree=2500,                      # how many trees to fit/how many data sets to ...
   perturb=list(replace=TRUE),      # ... sample w/ replacement
   mtry=3,                          # how many variables are eligible at each split
   cores=12) # how many threads to use, adjust as needed (set to 1 on Windoze)

5.2 Evaluating the forest’s performance

So, how well is this doing in terms of prediction? With regard to prediction, there’s probably not going to be much of a change because there are really only very few combinations of variable levels, which the previous forest probably dealt with ok.

d$PRED_PP_y <- predict( # make d$PRED_PP_y the result of predicting
   cf_2,                # from cf_2
   type="prob")[,"yes"] # predicted probabilities of "yes"
d$PRED_CAT <- predict(cf_2) # categorical predictions
(c_m <- table(OBS=d$SURVIVED, # cross-tabulate observed survival outcomes in the rows
   PREDS=d$PRED_CAT))         # predicted survival outcomes in the columns

     PREDS
OBS     no  yes
  no  1470   20
  yes  441  270

c( # evaluate the confusion matrix
   "Pred. acc."         =mean(d$SURVIVED==d$PRED_CAT),
   "Prec. for yes"=c_m["yes","yes"] / sum(c_m[     ,"yes"]),
   "Rec. for yes" =c_m["yes","yes"] / sum(c_m["yes",]),
   "Prec. for no" =c_m[ "no","no"]  / sum(c_m[     , "no"]),
   "Rec. for no"  =c_m[ "no","no"]  / sum(c_m[ "no",]))

   Pred. acc. Prec. for yes  Rec. for yes  Prec. for no   Rec. for no
    0.7905498     0.9310345     0.3797468     0.7692308     0.9865772

c("Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(d$SURVIVED, d$PRED_PP_y))

Cohen's kappa             C
    0.4334102     0.7682940

Indeed the results are the same, but the main focus of including the interaction predictors is not to boost prediction, but to try and see whether our explanation is improved so we move on to variable importance and then partial dependence:

5.3 Variable importance(s)

How do the variable importance scores change now that the interaction predictors are included?

sort(varimp(cf_2)) # sort the variable importance scores of cf_2

       AGE      CLASS  CLASSxAGE        SEX    SEXxAGE  CLASSxSEX
0.02506120 0.05680789 0.11606316 0.17746716 0.21202256 0.23031047

sort(varimp(cf_2,    # sort the variable importance scores of cf_2, but
   conditional=TRUE, # the conditional version
   cores=12))        # how many threads to use, adjust as needed (set to 1 on Windoze)

        SEX         AGE       CLASS   CLASSxSEX   CLASSxAGE     SEXxAGE
0.001559753 0.001857089 0.003531200 0.009586468 0.015919016 0.023494299

The interactions come out on top.

5.4 Partial dependence scores

And now let’s see what the interactions are doing:

(pd_cs <- partial(       # make pd.cs contain partial dependence scores
   object=cf_2,          # from this forest
   pred.var="CLASSxSEX", # for this predictor
   which.class=2,        # for the 2nd level of the response
   prob=TRUE,            # return results on the probability scale
   train=d))             # these were the training data

    CLASSxSEX      yhat
1  1st:female 0.6310151
2    1st:male 0.3395256
3  2nd:female 0.6228049
4    2nd:male 0.2826053
5  3rd:female 0.3309056
6    3rd:male 0.2669302
7 crew:female 0.6212056
8   crew:male 0.2863500

(pd_ca <- partial(       # make pd.ca contain partial dependence scores
   object=cf_2,          # from this forest
   pred.var="CLASSxAGE", # for this predictor
   which.class=2,        # for the 2nd level of the response
   prob=TRUE,            # return results on the probability scale
   train=d))             # these were the training data

   CLASSxAGE      yhat
1  1st:adult 0.3793968
2  1st:child 0.4704584
3  2nd:adult 0.2824445
4  2nd:child 0.5022200
5  3rd:adult 0.2987593
6  3rd:child 0.3397364
7 crew:adult 0.3309079

(pd_sa <- partial(     # make pd.sa contain partial dependence scores
   object=cf_2,        # from this forest
   pred.var="SEXxAGE", # for this predictor
   which.class=2,      # for the 2nd level of the response
   prob=TRUE,          # return results on the probability scale
   train=d))           # these were the training data

       SEXxAGE      yhat
1 female:adult 0.4845489
2 female:child 0.4777398
3   male:adult 0.2802215
4   male:child 0.4297447

This is easier to interpret when broken up a bit differently and maybe when visualized; here’s the example of CLASS:SEX:

        female      male
1st  0.6310151 0.3395256
2nd  0.6228049 0.2826053
3rd  0.3309056 0.2669302
crew 0.6212056 0.2863500

For CLASS:SEX, there is obviously some sort of interaction effect: We knew from pd_s that the chances of women surviving is generally much higher than that of men, but we now see that this effect doesn’t hold everywhere in the same way. In other words, being female doesn’t help you everywhere/always equally (see 3rd class).

Here are the corresponding plots for the other two interactions:

For CLASS:AGE, there is some sort of interaction effect. We knew from pd_a that the chances of children to survive were generally much higher than that of adults, but we now see that this effect is restricted to first and second class and much less strong in 3rd class. Finally SEX:AGE: We knew from pd_a that the chances of children to survive were generally much higher than that of adults and we knew from pd_s that the chances of females to survive were much higher than that of males. If both these effects were perfectly additive, we’d see that female children have the highest scores for survival, but they don’t: female adults do, but among females the difference between children and adults is actually very minor. However, among males, the difference between children and adults is considerably bigger.

I hope this makes clear that, as discussed earlier, you do not want to take variable importance scores and partial dependence scores for predictors as indications of them being important just as main effects. Without the more fine-grained analysis here, our understanding of the effects would be much less complete; check the exercises for SFLWR³: Chapter 7, for a bit (including plots for the interaction) on this example.

6 Revisiting the subject realization data

Let us now review the subject realization data from the regression modeling part:

rm(list=ls(all=TRUE)); invisible(gc())
invisible(sapply(dir("_helpers", full.names=TRUE), source))
summary(d <- read.delim(
   file="_input/subjreal.csv",
   stringsAsFactors=TRUE))

      CASE        SPKR          SPEAKERID     SUBJREAL    CONTRAST
 Min.   :    1   njs:7371   26-JE_NNS:  821   no  :5016   no  :6393
 1st Qu.: 3814   nns:7394   19-JC_NNS:  769   yes :1649   yes : 271
 Median : 7634              18-JK_NJS:  760   NA's:8100   NA's:8101
 Mean   : 7630              7-JE_NJS :  731
 3rd Qu.:11443              24-JE_NNS:  710
 Max.   :15265              1-JC_NJS :  705
                            (Other)  :10269
   GIVENNESS
 Min.   : 0.000
 1st Qu.: 8.000
 Median :10.000
 Mean   : 7.875
 3rd Qu.:10.000
 Max.   :10.000
 NA's   :8990

d <- droplevels(d[complete.cases(d),])

Let’s fit a conditional inference forest …

set.seed(sum(utf8ToInt("Kackvogel"))); cf_1 <- cforest(
   SUBJREAL ~                  # this response variable ~ 
   CONTRAST + GIVENNESS,       # w/ these predictors
   data=d,                     # the variables are in d
   ntree=2500,                 # how many trees to fit/how many data sets to ...
   perturb=list(replace=TRUE), # ... sample w/ replacement
   mtry=3,                     # how many variables are eligible at each split
   cores=12) # how many threads to use, adjust as needed (set to 1 on Windoze)

… and let’s get the predictions:

# much faster way to compute the predictions:
qwe <- predict(cf_1, type="prob") # all pred. probs. from cf_1
d$PRED_PP_y <- qwe[,"yes"] # predicted probabilities of "yes"
d$PRED_CAT <- levels(d$SUBJREAL)[apply(qwe, 1, which.max)] # categorical predictions
rm(qwe)
(c_m <- table(OBS=d$SUBJREAL, # cross-tabulate obs. subj realizations in the rows
   PREDS=d$PRED_CAT))         # predicted subject realizations in the columns

     PREDS
OBS     no  yes
  no  3932  261
  yes  636  946

c( # evaluate the confusion matrix
   "Pred. acc."   =mean(d$SUBJREAL==d$PRED_CAT),
   "Prec. for yes"=c_m["yes","yes"] / sum(c_m[     ,"yes"]),
   "Rec. for yes" =c_m["yes","yes"] / sum(c_m["yes",]),
   "Prec. for no" =c_m[ "no","no"]  / sum(c_m[     , "no"]),
   "Rec. for no"  =c_m[ "no","no"]  / sum(c_m[ "no",]))

   Pred. acc. Prec. for yes  Rec. for yes  Prec. for no   Rec. for no
    0.8446753     0.7837614     0.5979772     0.8607706     0.9377534

c("Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(d$SUBJREAL, d$PRED_PP_y))

Cohen's kappa             C
    0.5784201     0.7977769

Now, we have a pretty good understanding of this data set from the regression model. We know there’s a pretty strong interaction there – what do the variable importance scores say?

sort(varimp(cf_1))   # sort the variable importance scores of cf_1

  CONTRAST  GIVENNESS
0.06948541 0.24813726

sort(varimp(cf_1,    # sort the variable importance scores of cf_1, but
   conditional=TRUE, # the conditional version
   cores=12))        # how many threads to use, adjust as needed (set to 1 on Windoze)

  CONTRAST  GIVENNESS
0.06979104 0.24859358

Obviously, this does again not address the issue of the interaction … But here’s a way to now deal with interactions of a numeric and categorical predictor:

pd_cg <- partial(       # make pd.cs contain partial dependence scores
   object=cf_1,         # from this forest
   pred.var=c("GIVENNESS", "CONTRAST"), # for these predictors <-----------------
   which.class=2,       # for the 2nd level of the response
   prob=TRUE,           # return results on the probability scale
   train=d)             # these were the training data

Let’s change the output format to something slightly easier to look at and also plot this to see what it means:

(qwe <- tapply(pd_cg$yhat, list(pd_cg$GIVENNESS, pd_cg$CONTRAST), mean))

          no       yes
0  0.7728195 0.8412396
1  0.6590543 0.8955317
2  0.4765675 0.8862407
3  0.6867812 0.9096001
4  0.3341906 0.7078323
5  0.4885044 0.7313850
6  0.3864313 0.6775151
7  0.2558081 0.7269316
8  0.2197381 0.6677295
9  0.1855543 0.9488697
10 0.1187600 0.8370951

Figure 19: The interaction of CONTRAST:GIVENNESS

Now we’re talking …

sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  compiler
[8] methods   base

other attached packages:
[1] partykit_1.2-24 mvtnorm_1.3-3   libcoin_1.0-10  pdp_0.8.2
[5] STGmisc_1.02    Rcpp_1.1.0      magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] randomForest_4.7-1.2 Matrix_1.7-3         gtable_0.3.6
 [4] jsonlite_2.0.0       dplyr_1.1.4          rpart_4.1.24
 [7] tidyselect_1.2.1     parallel_4.5.1       splines_4.5.1
[10] scales_1.4.0         yaml_2.3.10          fastmap_1.2.0
[13] lattice_0.22-7       ggplot2_3.5.2        R6_2.6.1
[16] generics_0.1.4       Formula_1.2-5        knitr_1.50
[19] iterators_1.0.14     htmlwidgets_1.6.4    MASS_7.3-65
[22] inum_1.0-5           tibble_3.3.0         tree_1.0-44
[25] pillar_1.11.0        RColorBrewer_1.1-3   rlang_1.1.6
[28] xfun_0.52            cli_3.6.5            digest_0.6.37
[31] foreach_1.5.2        rstudioapi_0.17.1    lifecycle_1.0.4
[34] vctrs_0.6.5          evaluate_1.0.4       glue_1.8.0
[37] farver_2.1.2         codetools_0.2-20     survival_3.8-3
[40] rmarkdown_2.29       tools_4.5.1          pkgconfig_2.0.3
[43] htmltools_0.5.8.1

Footnotes

By the way, here’s an interesting little nugget that I’ve never seen or heard acknowledged in linguistics: This is what the documentation of cforest for both party (at least till version 1.3-5) and partykit (at least till version 1.2-10) says about itself (my emphasis): “Ensembles of conditional inference trees haven’t yet been extensively tested, so this routine is meant for the expert user only and its current state is rather experimental” … Also note that the use of parallelization leads to random and not replicable results because, I’m suspecting, the random number seed is not being passed on to the different cores/threads correctly, but, while fixes to that are available, the package maintainer maintains this is beyond his control.↩︎