We’ll start our discussion of trees by revisiting an example that we discussed before in the context of glms: the choice of a genitive construction with the data in _input/genitives.csv and information about the variables/columns in _input/genitives.r. As before, our response variable is GENITIVE:
rm(list=ls(all=TRUE)); invisible(gc())set.seed(sum(utf8ToInt("Gummibärchen")))library(pdp); library(tree)invisible(sapply(dir("_helpers", full.names=TRUE), source))summary(d <-read.delim( # summarize d, the result of loadingfile="_input/genitives.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
CASE GENITIVE SPEAKER MODALITY POR_LENGTH
Min. : 2 of:2720 nns:2666 spoken :1685 Min. : 1.00
1st Qu.:1006 s : 880 ns : 934 written:1915 1st Qu.: 8.00
Median :2018 Median : 11.00
Mean :2012 Mean : 14.58
3rd Qu.:3017 3rd Qu.: 17.00
Max. :4040 Max. :204.00
PUM_LENGTH POR_ANIMACY POR_FINAL_SIB POR_DEF
Min. : 2.00 animate : 920 absent :2721 definite :2349
1st Qu.: 6.00 collective: 607 present: 879 indefinite:1251
Median : 9.00 inanimate :1671
Mean : 10.35 locative : 243
3rd Qu.: 13.00 temporal : 159
Max. :109.00
Let’s re-compute the baseline for and deviance of our response variable; you should revisit the first fixef part to see why I could compute the deviance here as I do:
(baseline <-max( # make baseline the highestprop.table( # proportion in thetable(d$GENITIVE)))) # frequency table of the responsed$GENITIVE %>% table %>% prop.table %>%rep(table(d$GENITIVE)) %>% log %>% sum %>%"*"(-2) # deviance# but you can also fit a 'null tree':(cart_0 <-tree(GENITIVE ~1, data=d)) %>% deviance
[1] 0.7555556
[1] 4004.273
[1] 4004.273
Let’s begin with an extremely simple, monofactorial scenario, in which we’ll look at the question of whether the genitive choices (GENITIVE) are correlated with animacy of the possessor (POS_ANIMACY). We first do a simple descriptive cross-tabulation and then use the function tree::tree (Ripley 2019) to create a first classification tree, which we also immediately summarize:
addmargins(table(d$POR_ANIMACY, d$GENITIVE))
of s Sum
animate 370 550 920
collective 408 199 607
inanimate 1638 33 1671
locative 199 44 243
temporal 105 54 159
Sum 2720 880 3600
summary(cart_1 <-tree( # summarize a tree called cart_1 GENITIVE ~# a classification tree of GENITIVE ~ POR_ANIMACY, # POR_ANIMACYdata=d)) # those variables are in d
Classification tree:
tree(formula = GENITIVE ~ POR_ANIMACY, data = d)
Number of terminal nodes: 3
Residual mean deviance: 0.7749 = 2787 / 3597
Misclassification error rate: 0.1944 = 700 / 3600
Let’s start at the bottom, where the results indicate that the tree makes 700 incorrect classifications and, thus, 3600-700=2900 correct ones, i.e. it has an accuracy of 0.8056. Next, the tree that the algorithm came up with has three terminal nodes – what does it look like?
plot(cart_1);grid() # plot the classification treeaxis(2); mtext("Deviance", 2, 3) # add a useful y-axistext(cart_1, # add labels to the treepretty=4, # abbreviate the labels to 4 charactersall=TRUE) # label all nodes & splits
Figure 1: The classification tree of cart_1
The way to read this is quite simple: we start at the top, the so-called root (node) and see a first split and then right below it you see of. That of means ‘if you don’t do this split, you should predict of (because it’s the more frequent level of GENITIVE)’. The expression above the split should be read as a question or a conditional expression: Is POR_ANIMACYinanimate? If ‘yes’, you go to the left and down that branch to the first terminal node, which ‘predicts’ of; if ‘no’, you go to the right and down that branch, which now features no inanimate possessors anymore, to the second split, which asks ‘is POR_ANIMACYcollective, locative, or temporal?. If yes, you got to the left and again ’predict’ of, if no, you are left with only animate possessors and ‘predict’ s.
This information is also represented in the object cart_1 in tabular/text form:
cart_1$frame # doesn't get displayed in render, must be a bug
var n dev yval splits.cutleft splits.cutright yprob.of
1 POR_ANIMACY 3600 4004.2730 of :c :abde 0.75555556
2 <leaf> 1671 324.3722 of 0.98025135
3 POR_ANIMACY 1929 2645.4618 of :bde :a 0.56091239
6 <leaf> 1009 1222.9111 of 0.70564916
7 <leaf> 920 1239.9452 s 0.40217391
yprob.s
1 0.24444444
2 0.01974865
3 0.43908761
6 0.29435084
7 0.59782609
Let’s now extend this now to something more realistic (by using all predictors) and also look at tree’s output a bit more comprehensively:
summary(cart_2 <-tree( # summarize an object called cart_2 GENITIVE ~# GENITIVE ~ all predictors LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF,data=d)) # those variables are in d
Classification tree:
tree(formula = GENITIVE ~ LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER +
MODALITY + POR_FINAL_SIB + POR_DEF, data = d)
Variables actually used in tree construction:
[1] "POR_ANIMACY" "LEN_PORmPUM_LOG"
Number of terminal nodes: 5
Residual mean deviance: 0.7066 = 2540 / 3595
Misclassification error rate: 0.1625 = 585 / 3600
We get the same info as before, but now also a first (coarse) indication of variable importance, namely a list of the variables that actually feature in the tree. From the absence of several variables in that list, we can infer most variables are actually not relevant to the tree. Let’s look at this in more detail:
cart_2$frame # doesn't get displayed in render, must be a bug
var n dev yval splits.cutleft splits.cutright
1 POR_ANIMACY 3600 4004.2730 of :c :abde
2 <leaf> 1671 324.3722 of
3 POR_ANIMACY 1929 2645.4618 of :bde :a
6 LEN_PORmPUM_LOG 1009 1222.9111 of <-0.522544 >-0.522544
12 <leaf> 313 427.4180 s
13 <leaf> 696 633.5778 of
7 LEN_PORmPUM_LOG 920 1239.9452 s <0.821928 >0.821928
14 <leaf> 618 752.6407 s
15 <leaf> 302 402.2872 of
yprob.of yprob.s
1 0.75555556 0.24444444
2 0.98025135 0.01974865
3 0.56091239 0.43908761
6 0.70564916 0.29435084
12 0.42811502 0.57188498
13 0.83045977 0.16954023
7 0.40217391 0.59782609
14 0.29773463 0.70226537
15 0.61589404 0.38410596
Ok, so now looking at a plot might be more helpful … maybe:
plot(cart_2); grid() # plot the classification treeaxis(2); mtext("Deviance", 2, 3) # add a useful y-axistext(cart_2, # add labels to the treepretty=4, # abbreviate the labels to 4 charactersall=TRUE) # label all nodes & splits
Figure 2: The classification tree of cart_2
How do we interpret this?
of-genitives are predicted when
node 2: possessors are inanimate (pcorrect=0.98);
node 13: possessors are {collective, locative, temporal} and LEN_PORmPUM_LOG is > -0.522544 (i.e. the possessor is mostly longer than the possessum and usually not at 3 units longer than the possessum) (pcorrect=0.83);
node 15: possessors are animate and LEN_PORmPUM_LOG is > 0.821928 (i.e. the possessor is at least 3 units longer than the possessum) (pcorrect=0.616);
s-genitives are predicted when
node 12: possessors are {collective, locative, temporal} and LEN_PORmPUM_LOG is < -0.522544 (i.e. the possessor is at least 2 units shorter than the possessum) (pcorrect=0.572);
node 14: possessors are animate and LEN_PORmPUM_LOG < 0.821928 (i.e. the possessor is mostly shorter than the possessum and usually not more than 4 units longer) (pcorrect=0.70227).
So, let’s compute predictions and look at the tree’s performance in a bit more detail than just the accuracy. TBH, I hardly ever see the C-score reported for trees and forests, but of course that’s an option; similarly, I never see Cohen’s kappa or an R-squared reported for trees or forests – I don’t understand why no one is doing that, given that esp. McFadden’s R2 is very easy to compute and makes a lot of conceptual sense (both in general and because, obviously, deviance can play a big role for trees):
d$PRED_PP_s <-predict( # make d$PRED_PP_s the result of predicting cart_2 # from cart_2 )[,"s"] # predicted probabilities of "s"# if you don't say type="response", predict gives you log oddsd$PRED_CAT <-factor( # make d$PRED_CAT the result of checkingifelse(d$PRED_PP_s>=0.5, # if the predicted prob. of "s" is >=0.5levels(d$GENITIVE)[2], # if yes, predict "s"levels(d$GENITIVE)[1]), # otherwise, predict "of"levels=levels(d$GENITIVE)) # make sure both levels are registered(c_m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS of s
of 2402 318
s 267 613
c( # evaluate the confusion matrix"Prec. for s"=c_m[ "s", "s"] /sum(c_m[ , "s"]),"Acc./rec. for s"=c_m[ "s", "s"] /sum(c_m[ "s", ]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_CAT))
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
0.6584318 0.6965909 0.8999625 0.8830882
Acc. (overall)
0.8375000
McFadden's R-squared Cohen's kappa C
0.3656037 0.5685346 0.8793282
Pretty good actually!
detach(package:tree) # housekeeping
1.2 Conditional inference trees
An alternative tree-based approach that has gained some popularity in linguistic applications is that of conditional inference trees, in particular as implemented in the packages party (Hothorn et al. 2006) and partykit (Hothorn et al. 2015), which address concerns regarding the degree to which classification trees of the above kind can (i) overfit (which can be countered by pruning as alluded to above) and (ii) overestimate the importance of predictors with many different values (i.e. numeric predictors or categorical ones with many levels, see Strobl et al. 2009:342). This approach is conceptually similar to the kind of trees discussed above, but uses a p-value (adjusted for multiple tests) as a splitting criterion. Creating a conditional inference tree in R is (deceptively) simple, given everything we’ve discussed so far:
library(partykit) # the newer reimplementation of the package partyctree_1 <-ctree( # create an object called ctree_1 GENITIVE ~# GENITIVE ~ all predictors LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF,data=d) # those variables are in dplot(ctree_1)
Figure 3: The conditional inference tree of ctree_1
Actually, let’s tweak the plot a bit to make it more legible:
Figure 4: The conditional inference tree of ctree_1
Ever so slightly more complex … If you look at this tree, though, why might its additional complexity not really be that great?
s-genitives are really rare on the right, where inanimates and locatives are – s-genitives are only ever preferred with locatives and possessors that are notably shorter than possessums;
s-genitives are mostly preferred on the left and mostly with animate possessors unless the length difference prefers of-genitives more;
does this even have a significantly higher degree of accuracy or Cohen’s κ?
Let’s now evaluate this more formally:
d$PRED_PP_s <-predict( # make d$PRED_PP_s the result of predicting ctree_1, # from ctree_1type="prob")[,"s"] # predicted probabilities of "s"# if you don't say type="response", predict gives you log oddsd$PRED_CAT <-factor( # make d$PRED_CAT the result of checkingifelse(d$PRED_PP_s>0.5, # if the predicted prob. of "s" is >0.5 (note: >)levels(d$GENITIVE)[2], # if yes, predict "s"levels(d$GENITIVE)[1]), # otherwise, predict "of"levels=levels(d$GENITIVE)) # make sure both levels are registered# you also just use predict(ctree_1)(c_m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS of s
of 2432 288
s 257 623
c( # evaluate the confusion matrix"Prec. for s"=c_m[ "s", "s"] /sum(c_m[ , "s"]),"Acc./rec. for s"=c_m[ "s", "s"] /sum(c_m[ "s", ]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_CAT))
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
0.6838639 0.7079545 0.9044254 0.8941176
Acc. (overall)
0.8486111
McFadden's R-squared Cohen's kappa C
0.4169551 0.5949829 0.9042862
Is the accuracy of this tree (0.8486111) significantly better than that of the much simpler tree earlier (0.8375)? No (and the confidence interval of each Cohen’s κ includes the other one):
binom.test(3055, 3600, 3015/3600)$p.value # not really
[1] 0.0707569
However, one major problem of both kinds of trees is “their instability to small changes in the learning data […] the entire tree structure could be altered if the first splitting variable [something to which we will return in detail below], or only the first cutpoint, was chosen differently because of a small change in the learning data. Because of this instability, the predictions of single trees show a high variability” (Strobl et al. 2009:330). Another problem of trees is that “the prediction of single trees is piecewise constant and thus may jump from one value to the next even for small changes of the predictor values” (Strobl et al. 2009:331). Both of these problems can be dealt with to some degree using so-called ensemble methods such as random forests, to which we turn now.
detach(package:partykit) # housekeeping
2 Ensembles of trees: random forests
Random forests (Breiman 2001a, Hastie, Tibshirani, & Friedman 2014: Ch. 15, James et al. 2021: Section 8.2) are an instance of what are called ensemble methods because “an ensemble or committee of classification trees is aggregated for prediction” (Strobl et al. 2009:325). In other words, a random forest consists of ntree different trees, but with two additional layers of randomness, namely randomness
on the level of the data points because random forests involve running a tree on ntree different data sets that were bootstrapped – sampled with replacement – from the original data (see SFLWR3, Section 5.2.1.5) and then amalgamating all those ntree trees;
on the level of predictors because at every split in every tree, only a randomly-selected subset of the predictors is eligible to be chosen, namely mtry predictors.
Random forests usually overfit much less than individual trees or even training/test validation approaches and can be straightforwardly evaluated given that random forest accuracies are prediction, not classification, accuracies. Why/how is that? If you bootstrap with replacement from your data, then, like I mentioned above, each of the ntree trees will
be based only on the cases that were sampled (with replacement) this time around, i.e. their training set, which, on average, will be 63.2% of all your different cases);
not ‘have seen’ the, on average, remaining 100-63.2=36.8% of the cases that were not sampled for the current tree, i.e. cases that the current tree was not trained on, which are referred to as the held-out cases or as the OOB (out of bag) cases.
Random forests are often excellent at predictions. But there is also at least one disadvantage of random forests: While they excel at prediction (often outperforming regression models and individual trees in that sense), interpreting them can be hard: There is no single model, no single tree from a single data set that can be straightforwardly plotted – there’s a forest of, say, 500 different trees fit on 500 differently-sampled data sets, with thousands of different splits in different locations in trees with differently-sampled predictors available for splits. And because of precisely that I hope to convince you that the practice of trying to appealingly interpret/summarize such a forest of ntree trees with a single tree fitted to the whole data set is more than problematic; see James et al. (2021:343) for a comment regarding trees/forests to that effect:
“Recall that one of the advantages of decision trees is the attractive and easily interpreted diagram that results, such as the one displayed in Figure 8.1. However, when we bag a large number of trees, it is no longer possible to represent the resulting statistical learning procedure using a single tree […]”
With regard to the size or importance of an effect/predictor, the random forest implementations that seem to be most widely used in linguistics – randomForest::randomForest and party::cforest/partykit::cforest – offer the functionality of computing so-called variable importance scores, some version of which is regularly reported. Variable importance scores basically answer the question “how much (if at all) does a certain predictor contribute to predicting the response?”.
Warning: Note the exact formulation I’m using here: There’s an important reason why I did not simply say “how much does a certain predictor predict the response?” or “how much is a predictor correlated with the response?”.
With regard to the nature/direction of an effect, on the other hand, one can report partial dependence scores, which answer the question “how do values/levels of a certain predictor help predict the response?”, but for no good reason I see those much less often. Let’s now discuss how this all works; thankfully, in terms of R code, it’s all pretty (but again deceptively) straightforward.
2.1 Forests of classification and regression trees
Let’s revisit the genitive data using the maybe best-known implementation in randomForest::randomForest (Liaw & Wiener 2002). The following shows that we use the familiar formula notation, but we also tweak a few hyperparameters and arguments (the book discusses more):
ntree: how many trees we want the forest to have; the default is 500 but since the data set is not that big, we can straightforwardly increase that number because “[i]n general, the higher the number of trees, the more reliable the prediction and the interpretability of the variable importance” (Strobl et al. 2009:343);
mtry: how many predictors we want to be available at each split; the default for a forest of
classification trees is the square root of the number of predictors in the formula
regression trees is 1/3 of the number of predictors;
note that “[w]hen predictor variables are highly correlated, […] a higher number of randomly preselected predictor variables is better suited to reflect conditional importance” (Strobl et al. 2009:343);
importance: whether we want predictor importance to be computed (the default is FALSE).
library(randomForest)# set a replicable random number seedset.seed(sum(utf8ToInt("Bacon bits"))); (rf_1 <-randomForest(GENITIVE ~ LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF,data=d, # the data are in dntree=1500, # how many trees to fit/how many data sets to ...mtry=3, # how many variables are eligible at each splitimportance=TRUE)) # compute importance scores
Call:
randomForest(formula = GENITIVE ~ LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF, data = d, ntree = 1500, mtry = 3, importance = TRUE)
Type of random forest: classification
Number of trees: 1500
No. of variables tried at each split: 3
OOB estimate of error rate: 14.28%
Confusion matrix:
of s class.error
of 2452 268 0.09852941
s 246 634 0.27954545
The output shows that our forest obtained a prediction accuracy of 1-0.1428=0.8572 and that the classification/prediction error was much higher for s (0.28) than for of (0.099), but of course we want all individual predictions and the evaluation of our usual confusion matrix:
d$PRED_PP_s <-predict( # make d$PRED_PP_s the result of predicting rf_1, # from rf_1type="prob")[,"s"] # predicted probabilities of "s"d$PRED_CAT <-predict(rf_1) # categorical predictionsd$PRED_PP_obs <-ifelse( # d$PRED_PP_obs is determined by ifelse d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response d$PRED_PP_s, # take its predicted probability1-d$PRED_PP_s) # otherwise take 1 minus its predicted probabilityd$CONTRIBS2LL <--log(d$PRED_PP_obs)(c_m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS of s
of 2452 268
s 246 634
c( # evaluate the confusion matrix"Prec. for s"=c_m[ "s", "s"] /sum(c_m[ , "s"]),"Acc./rec. for s"=c_m[ "s", "s"] /sum(c_m[ "s", ]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_CAT))
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
0.7028825 0.7204545 0.9088213 0.9014706
Acc. (overall)
0.8572222
Better than the trees but again not by that much. But what about R-squared? For that, we need to compute the deviance of the forest manually and the plug that into our usual equation for McFadden’s R2:
Apparently, POR_ANIMACY is by far the most important variable, followed by LEN_PORmPUM_LOG, which nicely replicates the result from tree::tree.
I’ve seen many papers/talks that leave it at this or do something really (sorry …) self-contradictory, namely discuss the nature/direction of the effects in a monofactorial (e.g., chi-squared based) way, but we’ll do better and look at partial dependence scores, which we’ll compute with pdp::partial; when we do so, I always set the which.class argument to the second level of the response because I’m so used to that from regression modeling (just like I did above for computing d$PRED_PP_s). Let’s look at scores for one categorical predictor (POR_ANIMACY) and one numeric predictor (LEN_PORmPUM_LOG):
(pd_pa <-partial( # make pd_pa contain partial dependence scoresobject=rf_1, # from this forestpred.var="POR_ANIMACY", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=d)) # these were the training data
pd_ld <-partial( # make pd_ld contain partial dependence scoresobject=rf_1, # from this forestpred.var="LEN_PORmPUM_LOG", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=d) # these were the training datapd_ld[sort(sample(nrow(pd_ld), 5)),] # here's a sample of 5 rows
par(mfrow=c(1, 2))plot(pd_pa); grid()plot(pd_ld, type="l"); grid()par(mfrow=c(1, 1)) # reset to default
Figure 5: Partial dependence plots for rf_1 (basic)
This is very very similar to the effects plots we used for the regression results from before, and it’s interpretationally also very similar to the tree results. In my own work, I’d probably plot something like this, though – with bar widths reflecting the frequencies of the levels of POR_ANIMACY and point sizes reflecting the frequencies of the differences clause lengths:
Figure 6: Partial dependence plots for rf_1 (nicer)
No surprises there …
detach(package:randomForest) # housekeeping
2.2 Forests of conditional inference trees
Building a forests of conditional inference trees is only a bit different from the use of randomForest above:1
library(partykit)# set a replicable random number seedset.seed(sum(utf8ToInt("Gummibären"))); cf_1 <-cforest(GENITIVE ~ LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF,data=d, # the data are in dntree=1500, # how many trees to fit/how many data sets to ...perturb=list(replace=TRUE), # ... sample w/ replacementmtry=3) # how many variables are eligible at each splitsaveRDS(cf_1, file="_output/genitives_cf1.RDS")
As before, we can compute predictions and all the usual statistics:
d$PRED_PP_s <-predict( # make d$PRED_PP_s the result of predicting cf_1, # from cf_1type="prob")[,"s"] # predicted probabilities of "s"d$PRED_CAT <-predict(cf_1) # categorical predictions# below I will show you a faster way to do thisd$PRED_PP_obs <-ifelse( # d$PRED_PP_obs is determined by ifelse d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response d$PRED_PP_s, # take its predicted probability1-d$PRED_PP_s) # otherwise take 1 minus its predicted probabilityd$CONTRIBS2LL <--log(d$PRED_PP_obs)(c_m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS of s
of 2484 236
s 214 666
c( # evaluate the confusion matrix"Prec. for s"=c_m[ "s", "s"] /sum(c_m[ , "s"]),"Acc./rec. for s"=c_m[ "s", "s"] /sum(c_m[ "s", ]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_CAT))
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
0.7383592 0.7568182 0.9206820 0.9132353
Acc. (overall)
0.8750000
Similarly, we can compute variable importance scores with varimp, which computes importance measures whose logic is comparable to the permutation-accuracy one used in randomForest::importance above, where predictors’ associations with the response gets destroyed to measure the impact this has on prediction accuracy. However, with an argument called conditional you can choose how the permutation is done:
conditional=FALSE refers to the default of randomly permuting the column of interest;
conditional=TRUE means that the variable of interest is permuted in a way that takes its correlation with other predictors into consideration and can avoid exaggerating the importance of predictors that are correlated with other predictors (which is also why computing conditional scores takes considerably longer than computing unconditional ones or might not even be computable at all):
sort(varimp(cf_1)) # sort the variable importance scores of cf_1
# sort(varimp(cf_1, # sort the variable importance scores of cf_1, but# conditional=TRUE, # the conditional version# cores=10)) # how many threads to use, adjust as needed (set to 1 on Windoze)
Let’s use the pdp package already introduced above to again compute partial dependence scores for just two predictors:
(pd_pa <-partial( # make pd_pa contain partial dependence scoresobject=cf_1, # from this forestpred.var="POR_ANIMACY", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=d)) # these were the training data
pd_ld <-partial( # make pd_ld contain partial dependence scoresobject=cf_1, # from this forestpred.var="LEN_PORmPUM_LOG", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=d) # these were the training datapd_ld[seq(1, 51, 10),] # here's a sample of 5 rows
And then we could plot those with either the standard plots …
par(mfrow=c(1, 2)) # make the plotting window have 1 rows & 2 columnsplot(pd_pa); grid()plot(pd_ld, type="l"); grid()par(mfrow=c(1, 1)) # reset to default
Figure 7: Partial dependence plots for cf_1 (basic)
… or our nicer ones:
Figure 8: Partial dependence plots for cf_1 (nicer)
Here, too, the overall picture is similar to what we found before – just make sure you write it up well: We need variable importance scores (to know what’s doing how much in our data) and partial dependence scores (to know what is doing what in our data). Let’s now look at another non-linguistic application.
3 Women and children first?
We will use a non-linguistic example here simply because it’s such a great data set and question: When the Titanic sank in 1912, does the pattern of who survived and who didn’t suggest that people were following the women-and-children-first principle? The Titanic data are available as a data frame of the usual kind, which contains this information.
Load the data and use a tree-based approach to study the above question.
CASE CLASS SEX AGE SURVIVED
Min. : 1 1st :325 female: 470 adult:2092 no :1490
1st Qu.: 551 2nd :285 male :1731 child: 109 yes: 711
Median :1101 3rd :706
Mean :1101 crew:885
3rd Qu.:1651
Max. :2201
Ok, so this data set could become tricky for regression modeling, depending on how one proceeds: There’s a structural zero in the data (no children in the crew) and the number of children in general is very small, which could lead to regression trouble.
Let’s also compute the baseline that our analysis will try to beat and the response’s deviance:
(baseline <-max( # make baseline.1 the highestprop.table( # proportion in thetable(d$SURVIVED)))) # frequency table of the response variable
set.seed(sum(utf8ToInt("Kackvogel"))); cf_1 <-cforest(SURVIVED ~# a cond. inf. forest CLASS + SEX + AGE, # w/ these predictorsdata=d, # the variables are in xntree=750, # how many trees to fit/how many data sets to ...perturb=list(replace=TRUE), # ... sample w/ replacementmtry=2) # how many variables are eligible at each split
3.3 Evaluating the forest’s performance
How well does that forest do in terms of prediction accuracy, precision, recall, etc.?
d$PRED_PP_y <-predict( # make d$PRED_PP_y the result of predicting cf_1, # from cf_1type="prob")[,"yes"] # predicted probabilities of "yes"d$PRED_CAT <-predict(cf_1) # categorical predictions(c_m <-table(OBS=d$SURVIVED, # cross-tabulate observed survival outcomes in the rowsPREDS=d$PRED_CAT)) # predicted survival outcomes in the columns
PREDS
OBS no yes
no 1470 20
yes 441 270
d$PRED_PP_obs <-ifelse( # d$PREDS.NUM.t.obs is determined by ifelse d$SURVIVED=="yes", # if the obs outcome is the 2nd level of the response d$PRED_PP_y, # take its predicted probability1-d$PRED_PP_y) # otherwise take 1 minus its predicted probabilityc( # evaluate the confusion matrix"Pred. acc."=mean(d$SURVIVED==d$PRED_CAT),"Prec. for yes"=c_m["yes","yes"] /sum(c_m[ ,"yes"]),"Rec. for yes"=c_m["yes","yes"] /sum(c_m["yes",]),"Prec. for no"=c_m[ "no","no"] /sum(c_m[ , "no"]),"Rec. for no"=c_m[ "no","no"] /sum(c_m[ "no",]))
Pred. acc. Prec. for yes Rec. for yes Prec. for no Rec. for no
0.7905498 0.9310345 0.3797468 0.7692308 0.9865772
Which variables are most important for making these predictions?
sort(varimp(cf_1)) # sort the variable importance scores of cf_1
AGE CLASS SEX
0.08947584 0.15257832 0.30121848
sort(varimp(cf_1, # sort the variable importance scores of cf_1, butconditional=TRUE, # the conditional versioncores=10)) # how many threads to use, adjust as needed (set to 1 on Windoze)
AGE CLASS SEX
0.08190308 0.14815233 0.22788570
Same order for both …
3.5 Partial dependence scores
And how are these variables related to the prediction of the second level of the response variable, SURVIVED: yes?
(pd_s <-partial( # make pd.c contain partial dependence scoresobject=cf_1, # from this forestpred.var="SEX", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=d)) # these were the training data
SEX yhat
1 female 0.744173
2 male 0.218873
(pd_a <-partial( # make pd.c contain partial dependence scoresobject=cf_1, # from this forestpred.var="AGE", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=d)) # these were the training data
AGE yhat
1 adult 0.3157287
2 child 0.4831514
(pd_c <-partial( # make pd.c contain partial dependence scoresobject=cf_1, # from this forestpred.var="CLASS", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=d)) # these were the training data
This is certainly looking like women-and-children first: We’re predicting SURVIVED: yes and
for SEX, the score for female is much much higher than that for male;
for AGE, the score for child is much higher than that for adult;
for CLASS, the scores go down as you get to lower booking classes, but crew members score between 1st and 2nd class.
I’m not sure this rises to a level of complexity that requires visualization, but SFLWR3 provides some plots for this. However, this might all be premature, as we will see now …
4 Excursus on importance scores and interactions
Trees and especially forests have become increasingly frequent in linguistic applications in part because this family of approaches is said to have several attractive/advantageous features. These include the perceptions (!) that
tree-based approaches are often seen as a good ‘plan B’, i.e. a methodological fall-back position when regression modeling is seen as too difficult or (near) impossible for a certain data set;
tree-based methods outperform regression modeling in many – most? – respects (and I’m hoping people mean that like I did above, namely as ‘outperform in terms of prediction’)
the visualization of at least trees is more intuitively interpretable than the summary/coefficients table usually reported for regression models;
trees are particularly good, or even better than regression modeling, at exploring/revealing interactions and at handling non-linearities;
fitting trees/forests seems ‘so easy’: Even for data that would be very hard to model with statistical-modeling tools such as regression methods and would, therefore, require much preparation, exploration, diagnostics, and (cross-)validation, machine-learning tools like trees seem to take one or two lines of code and then you have a random forest with a good prediction accuracy (without diagnostics, validation, etc.), and easy-to-obtain variable importance scores.
However, the situation is not quite as simple and clear-cut. First, the ease of interpretation of trees is often overstated, as can be seen when you compare the following two plots, which represent the result of a tree (left panel) and of a regression (right panel) addressing the question whether the fuel efficiency of a car (mpg) is correlated with its weight (wt) (in the mtcars data set):
Figure 9: A slope in a regression tree vs. in a regression
Also, note that the regression model makes more fine-grained predictions (and heterogeneity of predictions is in part a reflection of discriminatory/predictive power):
Figure 10: Predictions from a regression and the tree
Finally, with regard to ease of interpretability, some trees can be nightmarishly complex. You saw the conditional inference tree on the genitive data above that was extremely complex (and wasn’t even significantly better than a very small tree from tree::tree) and here is Figure 1 of Bernaisch, Gries, & Mukherjee (2014:17), which is also not exactly fun:
Figure 11: Figure 1 of Bernaisch, Gries, & Mukherjee (2014:17)
Another, much more important issue is concerned with the notion of interactions, in particular their identifiability and their relation to variable importance scores. Let’s first consider how an interaction would be represented in a tree. Consider a scenario (from SFLWR3: Section 5.2.4.2) where you have
a numeric response LENGTH;
two binary predictors:
CLAUSE: main vs. subord;
GRAMREL: obj vs. subj.
Since interactions are technically defined as “something doesn’t do the same everywhere/always” (SFLWR3: Section 5.2.4.2), a possible interaction might be that
the difference between object and subject lengths is positive when CLAUSE is main but
the difference between object and subject lengths is negative when CLAUSE is subord.
In a conditional inference and a regression tree, this might look like this:
Figure 12: A 2-by-2 interaction in regression trees
Figure 13: A 2-by-2 interaction in regression trees
But there are cases where trees are not as great at interactions as one might think. As the depth/complexity of a tree grows, the interactions it can reveal become more and more intricate or specialized; it’s not uncommon to look at a more complex tree and see 4- or 5-way interactions, which are (i) extremely hard to describe and interpret and (ii) probably often not really explainable (let alone motivatable a priori). But sometimes it’s even worse: Sometimes, tree-based approaches don’t even see what they are supposedly so good at and/or provide quite importance results that are ambiguous and regularly misinterpreted.
Consider the following tiny data set of three binary predictors and one binary response (the s in the variable names stands for ‘small’):
The relation between the predictors and the response is such that P1s is most associated with the response Rs, P2s is less associated with Rs, and P3s is not associated with Rs at all:
table(P1s, Rs) # accuracy=0.7
Rs
P1s x y
a 7 3
b 3 7
table(P2s, Rs) # accuracy=0.6
Rs
P2s x y
e 6 4
f 4 6
table(P3s, Rs) # accuracy=0.5
Rs
P3s x y
m 6 6
n 4 4
However, there’s an interaction effect here such that P2s:P3s predicts Rs perfectly, i.e. with an accuracy of 1 (100%):
ftable( # make a flat contingency table P2s, P3s, # these things in the rows Rs) # this in the columns
Rs x y
P2s P3s
e m 6 0
n 0 4
f m 0 6
n 4 0
Why is this an interaction? Because, for instance, P3s: m doesn’t ‘do the same everywhere’:
it ‘promotes’ R2: x when P2s is e, but
it ‘promotes’ R2: y when P2s is f.
This data set exemplifies something that is at least relatable to the so-called XOR problem, “a situation where two variables show no main effect [not true of P2s here, though, which has a ‘main effect’] but a perfect interaction. In this case, because of the lack of a marginally detectable main effect, none of the variables may be selected in the first split of a classification tree, and the interaction may never be discovered” (Strobl et al. 2009:341). And indeed, tree::tree doesn’t see the perfectly predictive interaction – instead, it returns a tree with an accuracy of 0.75 that uses the interaction P1s:P3s (Note that rpart::rpart doesn’t fare better, neither does party(kit)::ctree, which makes no split at all.)
plot(tree_tree <- tree::tree(Rs ~ P1s+P2s+P3s)); grid() # plot the classification treeaxis(2); mtext("Deviance", 2, 3) # add a useful y-axistext(tree_tree, pretty=0, all=TRUE) # add all full labels to ittext( # plot textx=2.5, y=21, # at these coordinatespaste0("accuracy=", # this string, followed by the accuracymean(predict(tree_tree, type="class")==Rs)))
Figure 14: A tree on the (small) example data
But then, maybe this is because the data set is so small? Let’s increase it by one order of magnitude (the l in the variable names stands for ‘larger’) and try again :
P1l <-rep(P1s, 10) # predictor 1 largerP2l <-rep(P2s, 10) # predictor 2 largerP3l <-rep(P3s, 10) # predictor 3 largerRl <-rep(Rs, 10) # response largerDl <-data.frame(P1l, P2l, P3l, Rl)plot(tree_tree <- tree::tree(Rl ~ P1l+P2l+P3l)); grid() # plot the classification treeaxis(2); mtext("Deviance", 2, 3) # add a useful y-axistext(tree_tree, pretty=0, all=TRUE) # add all full labels to ittext( # plot textx=5.5, y=125, # at these coordinatespaste0("accuracy=", # this string, followed by the accuracymean(predict(tree_tree, type="class")==Rl)))
Figure 15: A tree on the (large) example data
Now we get perfect accuracy (also from rpart::rpart or party(kit)::ctree), but the result is still not great in terms of parsimony: We’re not easily shown that P1l is not needed at all – we’re essentially presented with a 3-way interaction when a 2-way interaction is all that’s needed. The reason is that “the split selection process in regular classification trees is only locally optimal in each node: A variable and cutpoint are chosen with respect to the impurity reduction they can achieve in a given node defined by all previous splits, but regardless of all splits yet to come” (Strobl et al. 2009:333, my emphasis). In other words, in all cases so far, the algorithm sees that P1 has the highest monofactorial association and runs with it, but it never ‘backtracks’ to see whether starting with a split on the ‘monofactorially inferior’ P2 or P3 would ultimately lead to an overall superior outcome. But this is of course where random forests should be able to help, because, in some of the trees of the forest,
the random sampling of the data might have (i) weakened the association of P1l to Rl and (ii) strengthened the association of P2l or P3l to Rl, maybe giving these monofactorially weaker predictors ‘a leg up’ to be chosen on the first split;
P1l will not be available for the first split, so that P2l will be chosen instead, and if then P3l is available for the second split, the tree will get perfect accuracy (and such results will augment these variables’ importance scores).
# compute partial dependence scores(pd_1 <-partial( # make pd_1 contain partial dependence scoresobject=rf_l, # from this forestpred.var="P1l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl)) # these were the training data
P1l yhat
1 a 0.2706
2 b 0.5476
(pd_2 <-partial( # make pd_2 contain partial dependence scoresobject=rf_l, # from this forestpred.var="P2l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl)) # these were the training data
P2l yhat
1 e 0.4109
2 f 0.4598
(pd_3 <-partial( # make pd_3 contain partial dependence scoresobject=rf_l, # from this forestpred.var="P3l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl)) # these were the training data
P3l yhat
1 m 0.4442
2 n 0.4904
In a sense, the variable importance scores and partial dependence scores are still not great: The fact that P2l:P3l is what really matters is still nowhere to be seen and even though the variable importance scores for P3l are really high, its partial dependence scores are really low. And all that counterintuitive mess in a data set as small and actually simple as this! Does a conditional inference forest perform better?
(pd_1 <-partial( # make pd_1 contain partial dependence scoresobject=cf_l, # from this forestpred.var="P1l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl)) # these were the training data
P1l yhat
1 a 0.2689748
2 b 0.5722766
(pd_2 <-partial( # make pd_2 contain partial dependence scoresobject=cf_l, # from this forestpred.var="P2l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl)) # these were the training data
P2l yhat
1 e 0.4705819
2 f 0.4342012
(pd_3 <-partial( # make pd_3 contain partial dependence scoresobject=cf_l, # from this forestpred.var="P3l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl)) # these were the training data
P3l yhat
1 m 0.4544925
2 n 0.5021139
It does a bit better when the conditional variable importance scores are used because those are (nicely) small for P1l – but we still don’t get the one piece of information everyone would want, that P2l:P3l ‘does it all’. And we again have the seemingly counterintuitive result that the variable we know does nothing monofactorially – P3l – has low partial dependence scores (ok) but the by far highest variable importance score (?). So you need to understand the following (which is something I figure many users of forests do actually not know, judging from their discussions of forests – I dare you find a discussion of forest results in linguistics that makes the following point clear and follows through on its implications): Especially if we’re used to regression modeling, what we’re getting shown from the variable importance score looks like a main effect, like ‘P3l (on its own) does a lot!’, but what the variable importance score really means is ‘P3l is (somehow!) important for making good predictions’, where somehow means P3l might
theoretically be a strong main effect (which, here, we know it is not from the simple cross-tabulation at the beginning);
be involved in one or more strong two-way interactions with P1l and/or P2l, i.e. the ‘real’ effect of P3l would be, or would follow from, what in regression modeling we’d write as P1l:P3l and/or P2l:P3l;
be involved in a strong three-way interactions with P1l and P2l, i.e. the ‘real’ effect P3l would be in, or would follow from, P1l:P2l:P3l.
Thus, a high variable importance score for P3l can mean that any one (or more!) of the following things are important: P3l, P1l:P3l, P2l:P3l, or P1l:P2lP3l, but you’re not told which of these it is. But this fact about variable importance scores is something that most papers using random forests that I’ve seen seem to have not fully understood. Most papers I’ve seen arrive at a high variable importance score of some predictor like P3l and then interpret the effect of that ‘important’ variable in a monofactorial main-effect kind of way. Many studies that used random forests will most likely have made a big monofactorial deal out of the predictors for which they obtained a high variable importance score without ever realizing that the real reason for the predictor was its participation in an interaction, and you shouldn’t trust the discussion if the authors don’t make it clear how they followed up on the variable importance score(s).
So how can we address this? As I discuss in Gries (2020 or in SFLWR3: Section 7:3), we can address this by following
Forina et al. (2009), who “augment […] the prediction data with addition of complex combinations of the data” (Kuhn & Johnson 2013:48, also see p. 57f.);
Strobl et al. (2009:341), who state that “an interaction effect of two variables can also be explicitly added as a potential predictor variable”;
Baayen (2011:305f.), whose application of Naïve Discriminative Learning includes “pairs of features” as “the functional equivalent of interactions in a regression model”.
That means, we force the tree-based approach to consider the combined effect of P2l and P3l by explicitly including in the formula the combination/interaction of these two predictors or, more generally, all predictors in whose interactions one might be interested in. For this data set, it just means we create three new predictors:
partial( # show partial dependence scoresobject=rf_l, # from this forestpred.var="P2lxP3l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl) # these were the training data
partial( # show partial dependence scoresobject=cf_l, # from this forestpred.var="P2lxP3l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl) # these were the training data
Now (i) every tree/forest returns perfect accuracy, (ii) the forests return the by far highest variable importance scores for P2lxP3l, and (iii) the partial dependence scores of that predictor reveal the interaction pattern we knew is in the data … finally! And if you also add the three way interaction P1lxP2lxP3l to the formula, then tree, ctree, and cforest all still recover that P2lxP3l is most important – only randomForest assigns a variable importance score to P1lxP2lxP3l that’s slightly higher than that of P2lxP3l).
Thus, bottom line, trees and especially more robust and powerful forests can be good at detecting interactions, but it’s nowhere near as simple as is often thought. One should always (i) compute a forest, its accuracy, precision, and recall, but also (ii) variable importance scores and (iii) partial dependence scores. High variable importance scores for a predictor can mean many things: don’t take them at face value and let them fool you into automatically assuming they reflect an important ‘main effect’ and then you plot/interpret that and you’re done – a high variable importance score only means ‘that predictor does something, somehow’ and you need to do follow-up analyses to find out what exactly that is. And studying that can be done with interaction predictors (or in other ways, see Gries 2020). In fact, Strobl et al. (2009:339) go so far as to advise “In general, however, the complex random forest model, involving high-order interactions and nonlinearity, should be compared to a simpler model (e.g., a linear or logistic regression model including only low-order interactions) whenever possible to decide whether the simpler, interpretable model would be equally adequate”. Did you catch that?? Random forests are the more complex models, because each tree of a forest can easily involve 3-, 4-, 5-, … way interactions that people would never dream of fitting in a regression model but that they rely unquestioningly in a forest, no questions asked. Don’t let the apparent simplicity of the R code for a forest trick you into believing that forests are the simpler method; part of why they are doing so much better in predictions than regressions is that they can involve very complex partitioning of the data.
Given this underappreciated (by many) complexity of random forests, Strobl et al. (2009:341) conclude that
random forests can at least serve as a benchmark predictor: If a linear or other parametric model with a limited number and degree of interaction terms can reach the (cross-validated or test sample) prediction accuracy of the more complex model, the extra complexity may be uncalled for and the simpler, interpretable model should be given preference. If, however, the prediction accuracy cannot be reached with the simpler model, and, for example, the high importance of a variable in a random forest is not reflected by its respective parameters in the simpler model, relevant nonlinear or interaction effects may be missing in the simpler model, and it may not be suited to grasping the complexity of the underlying process.
In other words, ‘consider doing both!’ – a regression and a tree-based method.
5 Women and children first? Part 2
Let’s revisit the Titanic then because, after having seen notable variable importance scores for SEX and AGE, the question becomes, what do these values for SEX and AGE reflect: main effects of, say, SEX (‘women first’!) and AGE (‘children first’!) or interactions that SEX and AGE participate in? Let’s find out and let’s include all pairwise interactions. We first add those interactions to the data frame x:
# add interaction predictors for the two main predictors of interestd$CLASSxSEX <-droplevels(d$CLASS:d$SEX) # using droplevels to eliminated$CLASSxAGE <-droplevels(d$CLASS:d$AGE) # factor level combinations thatd$SEXxAGE <-droplevels(d$SEX:d$AGE) # aren't attested
5.1 Fitting a conditional inference forest
So now we fit the next forest and include those:
set.seed(sum(utf8ToInt("Kackvogel"))); cf_2 <-cforest( SURVIVED ~ CLASS + SEX + AGE +# w/ these predictors CLASSxSEX + CLASSxAGE + SEXxAGE, # w/ these interaction predictorsdata=d, # the variables are in dntree=2500, # how many trees to fit/how many data sets to ...perturb=list(replace=TRUE), # ... sample w/ replacementmtry=3, # how many variables are eligible at each splitcores=12) # how many threads to use, adjust as needed (set to 1 on Windoze)
5.2 Evaluating the forest’s performance
So, how well is this doing in terms of prediction? With regard to prediction, there’s probably not going to be much of a change because there are really only very few combinations of variable levels, which the previous forest probably dealt with ok.
d$PRED_PP_y <-predict( # make d$PRED_PP_y the result of predicting cf_2, # from cf_2type="prob")[,"yes"] # predicted probabilities of "yes"d$PRED_CAT <-predict(cf_2) # categorical predictions(c_m <-table(OBS=d$SURVIVED, # cross-tabulate observed survival outcomes in the rowsPREDS=d$PRED_CAT)) # predicted survival outcomes in the columns
PREDS
OBS no yes
no 1470 20
yes 441 270
c( # evaluate the confusion matrix"Pred. acc."=mean(d$SURVIVED==d$PRED_CAT),"Prec. for yes"=c_m["yes","yes"] /sum(c_m[ ,"yes"]),"Rec. for yes"=c_m["yes","yes"] /sum(c_m["yes",]),"Prec. for no"=c_m[ "no","no"] /sum(c_m[ , "no"]),"Rec. for no"=c_m[ "no","no"] /sum(c_m[ "no",]))
Pred. acc. Prec. for yes Rec. for yes Prec. for no Rec. for no
0.7905498 0.9310345 0.3797468 0.7692308 0.9865772
Indeed the results are the same, but the main focus of including the interaction predictors is not to boost prediction, but to try and see whether our explanation is improved so we move on to variable importance and then partial dependence:
5.3 Variable importance(s)
How do the variable importance scores change now that the interaction predictors are included?
sort(varimp(cf_2)) # sort the variable importance scores of cf_2
AGE CLASS CLASSxAGE SEX SEXxAGE CLASSxSEX
0.02506120 0.05680789 0.11606316 0.17746716 0.21202256 0.23031047
sort(varimp(cf_2, # sort the variable importance scores of cf_2, butconditional=TRUE, # the conditional versioncores=12)) # how many threads to use, adjust as needed (set to 1 on Windoze)
SEX AGE CLASS CLASSxSEX CLASSxAGE SEXxAGE
0.001559753 0.001857089 0.003531200 0.009586468 0.015919016 0.023494299
The interactions come out on top.
5.4 Partial dependence scores
And now let’s see what the interactions are doing:
(pd_cs <-partial( # make pd.cs contain partial dependence scoresobject=cf_2, # from this forestpred.var="CLASSxSEX", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=d)) # these were the training data
(pd_ca <-partial( # make pd.ca contain partial dependence scoresobject=cf_2, # from this forestpred.var="CLASSxAGE", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=d)) # these were the training data
(pd_sa <-partial( # make pd.sa contain partial dependence scoresobject=cf_2, # from this forestpred.var="SEXxAGE", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=d)) # these were the training data
For CLASS:SEX, there is obviously some sort of interaction effect: We knew from pd_s that the chances of women surviving is generally much higher than that of men, but we now see that this effect doesn’t hold everywhere in the same way. In other words, being female doesn’t help you everywhere/always equally (see 3rd class).
Here are the corresponding plots for the other two interactions:
Figure 17: The interaction of CLASS:AGE
Figure 18: The interaction of SEX:AGE
For CLASS:AGE, there is some sort of interaction effect. We knew from pd_a that the chances of children to survive were generally much higher than that of adults, but we now see that this effect is restricted to first and second class and much less strong in 3rd class. Finally SEX:AGE: We knew from pd_a that the chances of children to survive were generally much higher than that of adults and we knew from pd_s that the chances of females to survive were much higher than that of males. If both these effects were perfectly additive, we’d see that female children have the highest scores for survival, but they don’t: female adults do, but among females the difference between children and adults is actually very minor. However, among males, the difference between children and adults is considerably bigger.
I hope this makes clear that, as discussed earlier, you do not want to take variable importance scores and partial dependence scores for predictors as indications of them being important just as main effects. Without the more fine-grained analysis here, our understanding of the effects would be much less complete; check the exercises for SFLWR3: Chapter 7, for a bit (including plots for the interaction) on this example.
6 Revisiting the subject realization data
Let us now review the subject realization data from the regression modeling part:
CASE SPKR SPEAKERID SUBJREAL CONTRAST
Min. : 1 njs:7371 26-JE_NNS: 821 no :5016 no :6393
1st Qu.: 3814 nns:7394 19-JC_NNS: 769 yes :1649 yes : 271
Median : 7634 18-JK_NJS: 760 NA's:8100 NA's:8101
Mean : 7630 7-JE_NJS : 731
3rd Qu.:11443 24-JE_NNS: 710
Max. :15265 1-JC_NJS : 705
(Other) :10269
GIVENNESS
Min. : 0.000
1st Qu.: 8.000
Median :10.000
Mean : 7.875
3rd Qu.:10.000
Max. :10.000
NA's :8990
d <-droplevels(d[complete.cases(d),])
Let’s fit a conditional inference forest …
set.seed(sum(utf8ToInt("Kackvogel"))); cf_1 <-cforest( SUBJREAL ~# this response variable ~ CONTRAST + GIVENNESS, # w/ these predictorsdata=d, # the variables are in dntree=2500, # how many trees to fit/how many data sets to ...perturb=list(replace=TRUE), # ... sample w/ replacementmtry=3, # how many variables are eligible at each splitcores=12) # how many threads to use, adjust as needed (set to 1 on Windoze)
… and let’s get the predictions:
# much faster way to compute the predictions:qwe <-predict(cf_1, type="prob") # all pred. probs. from cf_1d$PRED_PP_y <- qwe[,"yes"] # predicted probabilities of "yes"d$PRED_CAT <-levels(d$SUBJREAL)[apply(qwe, 1, which.max)] # categorical predictionsrm(qwe)(c_m <-table(OBS=d$SUBJREAL, # cross-tabulate obs. subj realizations in the rowsPREDS=d$PRED_CAT)) # predicted subject realizations in the columns
PREDS
OBS no yes
no 3932 261
yes 636 946
c( # evaluate the confusion matrix"Pred. acc."=mean(d$SUBJREAL==d$PRED_CAT),"Prec. for yes"=c_m["yes","yes"] /sum(c_m[ ,"yes"]),"Rec. for yes"=c_m["yes","yes"] /sum(c_m["yes",]),"Prec. for no"=c_m[ "no","no"] /sum(c_m[ , "no"]),"Rec. for no"=c_m[ "no","no"] /sum(c_m[ "no",]))
Pred. acc. Prec. for yes Rec. for yes Prec. for no Rec. for no
0.8446753 0.7837614 0.5979772 0.8607706 0.9377534
Now, we have a pretty good understanding of this data set from the regression model. We know there’s a pretty strong interaction there – what do the variable importance scores say?
sort(varimp(cf_1)) # sort the variable importance scores of cf_1
CONTRAST GIVENNESS
0.06948541 0.24813726
sort(varimp(cf_1, # sort the variable importance scores of cf_1, butconditional=TRUE, # the conditional versioncores=12)) # how many threads to use, adjust as needed (set to 1 on Windoze)
CONTRAST GIVENNESS
0.06979104 0.24859358
Obviously, this does again not address the issue of the interaction … But here’s a way to now deal with interactions of a numeric and categorical predictor:
pd_cg <-partial( # make pd.cs contain partial dependence scoresobject=cf_1, # from this forestpred.var=c("GIVENNESS", "CONTRAST"), # for these predictors <-----------------which.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=d) # these were the training data
Let’s change the output format to something slightly easier to look at and also plot this to see what it means:
By the way, here’s an interesting little nugget that I’ve never seen or heard acknowledged in linguistics: This is what the documentation of cforest for both party (at least till version 1.3-5) and partykit (at least till version 1.2-10) says about itself (my emphasis): “Ensembles of conditional inference trees haven’t yet been extensively tested, so this routine is meant for the expert user only and its current state is rather experimental” … Also note that the use of parallelization leads to random and not replicable results because, I’m suspecting, the random number seed is not being passed on to the different cores/threads correctly, but, while fixes to that are available, the package maintainer maintains this is beyond his control.↩︎
Source Code
---title: "Predictive modeling @ JLU Giessen: tree-based approaches"author: - name: "[Stefan Th. Gries](https://www.stgries.info)" affiliation: - UC Santa Barbara - JLU Giessen orcid: 0000-0002-6497-3958date: "2025-08-03 12:34:56"date-format: "DD MMM YYYY HH-mm-ss"editor: sourceformat: html: page-layout: full code-fold: false code-link: true code-copy: true code-tools: true code-line-numbers: true code-overflow: wrap number-sections: true smooth-scroll: true toc: true toc-depth: 4 number-depth: 4 toc-location: left monofont: lucida console tbl-cap-location: top fig-cap-location: bottom fig-width: 6 fig-height: 6 fig-format: png fig-dpi: 300 fig-align: center embed-resources: trueexecute: cache: false echo: true eval: true warning: false---# Trees## Classification and regression treesWe'll start our discussion of trees by revisiting an example that we discussed before in the context of glms: the choice of a genitive construction with the data in [_input/genitives.csv](_input/genitives.csv) and information about the variables/columns in [_input/genitives.r](_input/genitives.r). As before, our response variable is `GENITIVE`:```{r}rm(list=ls(all=TRUE)); invisible(gc())set.seed(sum(utf8ToInt("Gummibärchen")))library(pdp); library(tree)invisible(sapply(dir("_helpers", full.names=TRUE), source))summary(d <-read.delim( # summarize d, the result of loadingfile="_input/genitives.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factorsd$LEN_PORmPUM_LOG <-log2(d$POR_LENGTH)-log2(d$PUM_LENGTH)d <- d[,c(1:4,10,6:9,5)]```Let's re-compute the baseline for and deviance of our response variable; you should revisit the first fixef part to see why I could compute the deviance here as I do:```{r}#| results: hold(baseline <-max( # make baseline the highestprop.table( # proportion in thetable(d$GENITIVE)))) # frequency table of the responsed$GENITIVE %>% table %>% prop.table %>%rep(table(d$GENITIVE)) %>% log %>% sum %>%"*"(-2) # deviance# but you can also fit a 'null tree':(cart_0 <-tree(GENITIVE ~1, data=d)) %>% deviance```Let's begin with an extremely simple, monofactorial scenario, in which we'll look at the question of whether the genitive choices (`GENITIVE`) are correlated with animacy of the possessor (`POS_ANIMACY`). We first do a simple descriptive cross-tabulation and then use the function `tree::tree` (Ripley 2019) to create a first classification tree, which we also immediately summarize:```{r}addmargins(table(d$POR_ANIMACY, d$GENITIVE))summary(cart_1 <-tree( # summarize a tree called cart_1 GENITIVE ~# a classification tree of GENITIVE ~ POR_ANIMACY, # POR_ANIMACYdata=d)) # those variables are in d```Let's start at the bottom, where the results indicate that the tree makes 700 incorrect classifications and, thus, 3600-700=2900 correct ones, i.e. it has an accuracy of 0.8056. Next, the tree that the algorithm came up with has three terminal nodes -- what does it look like?```{r}#| label: fig-treemonofact#| fig-cap: The classification tree of cart_1#| fig-height: 3.5#| fig-width: 5plot(cart_1);grid() # plot the classification treeaxis(2); mtext("Deviance", 2, 3) # add a useful y-axistext(cart_1, # add labels to the treepretty=4, # abbreviate the labels to 4 charactersall=TRUE) # label all nodes & splits```The way to read this is quite simple: we start at the top, the so-called **root (node)** and see a first split and then right below it you see *of*. That *of* means 'if you don't do this split, you should predict *of* (because it's the more frequent level of `GENITIVE`)'. The expression above the split should be read as a question or a conditional expression: Is `POR_ANIMACY` *inanimate*? If 'yes', you go to the left and down that branch to the first terminal node, which 'predicts' *of*; if 'no', you go to the right and down that branch, which now features no inanimate possessors anymore, to the second split, which asks 'is `POR_ANIMACY` *collective*, *locative*, or *temporal*?. If yes, you got to the left and again 'predict' *of*, if no, you are left with only animate possessors and 'predict' *s*.This information is also represented in the object `cart_1` in tabular/text form:```{r}cart_1cart_1$frame # doesn't get displayed in render, must be a bug```Let's now extend this now to something more realistic (by using all predictors) and also look at `tree`'s output a bit more comprehensively:```{r}summary(cart_2 <-tree( # summarize an object called cart_2 GENITIVE ~# GENITIVE ~ all predictors LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF,data=d)) # those variables are in d``````{r residualmeandeviance}#| echo: false#| eval: falsedeviance(cart_2) / "-"( length(cart_2$where), # number of data points length(unique(cart_2$where))) # number of node```We get the same info as before, but now also a first (coarse) indication of variable importance, namely a list of the variables that actually feature in the tree. From the absence of several variables in that list, we can infer most variables are actually not relevant to the tree. Let's look at this in more detail:```{r}cart_2cart_2$frame # doesn't get displayed in render, must be a bug```Ok, so now looking at a plot might be more helpful ... maybe:```{r}#| label: fig-treemultifact#| fig-cap: The classification tree of cart_2#| fig-height: 5#| fig-width: 7plot(cart_2); grid() # plot the classification treeaxis(2); mtext("Deviance", 2, 3) # add a useful y-axistext(cart_2, # add labels to the treepretty=4, # abbreviate the labels to 4 charactersall=TRUE) # label all nodes & splits```How do we interpret this?* *of*-genitives are predicted when + node 2: possessors are inanimate (*p*~correct~=0.98); + node 13: possessors are {collective, locative, temporal} and `LEN_PORmPUM_LOG` is > -0.522544 (i.e. the possessor is mostly longer than the possessum and usually not at 3 units longer than the possessum) (*p*~correct~=0.83); + node 15: possessors are animate and `LEN_PORmPUM_LOG` is > 0.821928 (i.e. the possessor is at least 3 units longer than the possessum) (*p*~correct~=0.616);* *s*-genitives are predicted when + node 12: possessors are {collective, locative, temporal} and `LEN_PORmPUM_LOG` is < -0.522544 (i.e. the possessor is at least 2 units shorter than the possessum) (*p*~correct~=0.572); + node 14: possessors are animate and `LEN_PORmPUM_LOG` < 0.821928 (i.e. the possessor is mostly shorter than the possessum and usually not more than 4 units longer) (*p*~correct~=0.70227).```{r}#| echo: false#| eval: falsed_2 <-droplevels(d[d$POR_ANIMACY=="inanimate",]) # terminal: ofd_3 <-droplevels(d[d$POR_ANIMACY!="inanimate",]) d_3_06_12 <-droplevels(d_3[d_3$POR_ANIMACY!="animate"& d_3$LEN_PORmPUM_LOG <-0.522544,]) # terminal: s d_3_06_13 <-droplevels(d_3[d_3$POR_ANIMACY!="animate"& d_3$LEN_PORmPUM_LOG >-0.522544,]) # terminal: of d_3_07_14 <-droplevels(d_3[d_3$POR_ANIMACY=="animate"& d_3$LEN_PORmPUM_LOG <0.821928,]) # terminal: s d_3_07_15 <-droplevels(d_3[d_3$POR_ANIMACY=="animate"& d_3$LEN_PORmPUM_LOG >0.821928,]) # terminal: ofplot(pch=16, col="#00000020", d_3_06_12$POR_LENGTH, d_3_06_12$PUM_LENGTH); abline(0,1); grid()table(d_3_06_12$POR_LENGTH - d_3_06_12$PUM_LENGTH)```So, let's compute predictions and look at the tree's performance in a bit more detail than just the accuracy. TBH, I hardly ever see the *C*-score reported for trees and forests, but of course that's an option; similarly, I *never* see Cohen's *kappa* or an R-squared reported for trees or forests -- I don't understand why no one is doing that, given that esp. McFadden's *R*^2^ is very easy to compute and makes a lot of conceptual sense (both in general and because, obviously, deviance can play a big role for trees):```{r}d$PRED_PP_s <-predict( # make d$PRED_PP_s the result of predicting cart_2 # from cart_2 )[,"s"] # predicted probabilities of "s"# if you don't say type="response", predict gives you log oddsd$PRED_CAT <-factor( # make d$PRED_CAT the result of checkingifelse(d$PRED_PP_s>=0.5, # if the predicted prob. of "s" is >=0.5levels(d$GENITIVE)[2], # if yes, predict "s"levels(d$GENITIVE)[1]), # otherwise, predict "of"levels=levels(d$GENITIVE)) # make sure both levels are registered(c_m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_CAT)) # predicted orders in the columnsc( # evaluate the confusion matrix"Prec. for s"=c_m[ "s", "s"] /sum(c_m[ , "s"]),"Acc./rec. for s"=c_m[ "s", "s"] /sum(c_m[ "s", ]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_CAT))c("McFadden's R-squared"=(deviance(cart_0)-deviance(cart_2)) /deviance(cart_0),"Cohen's kappa"=cohens.kappa(c_m)[[1]],"C"=C.score(d$GENITIVE, d$PRED_PP_s))```Pretty good actually!```{r}detach(package:tree) # housekeeping```## Conditional inference treesAn alternative tree-based approach that has gained some popularity in linguistic applications is that of conditional inference trees, in particular as implemented in the packages `party` ([Hothorn et al. 2006](https://www.tandfonline.com/doi/abs/10.1198/106186006X133933)) and `partykit` ([Hothorn et al. 2015](https://jmlr.org/papers/v16/hothorn15a.html)), which address concerns regarding the degree to which classification trees of the above kind can (i) overfit (which can be countered by pruning as alluded to above) and (ii) overestimate the importance of predictors with many different values (i.e. numeric predictors or categorical ones with many levels, see [Strobl et al. 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927982/):342). This approach is conceptually similar to the kind of trees discussed above, but uses a *p*-value (adjusted for multiple tests) as a splitting criterion. Creating a conditional inference tree in R is (deceptively) simple, given everything we've discussed so far:```{r}#| label: fig-ctreemultfact1#| fig-cap: The conditional inference tree of ctree_1#| fig-width: 10library(partykit) # the newer reimplementation of the package partyctree_1 <-ctree( # create an object called ctree_1 GENITIVE ~# GENITIVE ~ all predictors LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF,data=d) # those variables are in dplot(ctree_1)```Actually, let's tweak the plot a bit to make it more legible:```{r}#| label: fig-ctreemultfact2#| fig-cap: The conditional inference tree of ctree_1#| fig-width: 10levels(ctree_1$data$POR_ANIMACY) <-abbreviate(levels(ctree_1$data$POR_ANIMACY), 3)levels(ctree_1$data$MODALITY) <-c("spok", "writ")levels(ctree_1$data$POR_DEF) <-c("def", "indef")levels(ctree_1$data$POR_FINAL_SIB) <-c("abs", "pres")plot(ctree_1)```Ever so slightly more complex ... If you look at this tree, though, why might its additional complexity not really be that great?* *s*-genitives are really rare on the right, where inanimates and locatives are -- *s*-genitives are only ever preferred with locatives and possessors that are notably shorter than possessums;* *s*-genitives are mostly preferred on the left and mostly with animate possessors unless the length difference prefers *of*-genitives more;* does this even have a significantly higher degree of accuracy or Cohen's *κ*?Let's now evaluate this more formally:```{r}d$PRED_PP_s <-predict( # make d$PRED_PP_s the result of predicting ctree_1, # from ctree_1type="prob")[,"s"] # predicted probabilities of "s"# if you don't say type="response", predict gives you log oddsd$PRED_CAT <-factor( # make d$PRED_CAT the result of checkingifelse(d$PRED_PP_s>0.5, # if the predicted prob. of "s" is >0.5 (note: >)levels(d$GENITIVE)[2], # if yes, predict "s"levels(d$GENITIVE)[1]), # otherwise, predict "of"levels=levels(d$GENITIVE)) # make sure both levels are registered# you also just use predict(ctree_1)(c_m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_CAT)) # predicted orders in the columnsc( # evaluate the confusion matrix"Prec. for s"=c_m[ "s", "s"] /sum(c_m[ , "s"]),"Acc./rec. for s"=c_m[ "s", "s"] /sum(c_m[ "s", ]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_CAT))c("McFadden's R-squared"=(deviance(cart_0)-2334.671) /deviance(cart_0),"Cohen's kappa"=cohens.kappa(c_m)[[1]],"C"=C.score(d$GENITIVE, d$PRED_PP_s))```Is the accuracy of this tree (0.8486111) significantly better than that of the much simpler tree earlier (0.8375)? No (and the confidence interval of each Cohen's *κ* includes the other one):```{r}binom.test(3055, 3600, 3015/3600)$p.value # not really``````{r}#| echo: false#| eval: falseall_indiv_probs <-setNames(dbinom(0:3600, 3600, 3015/3600),0:3600)# greater than expected:gte <-sum(all_indiv_probs[as.character(3055:3600)])# less than expected: the highest number of successes who cumulative# sum of probs that does not exceed gte:lte <-sum(all_indiv_probs[cumsum(all_indiv_probs) < gte])gte + lte```However, one major problem of both kinds of trees is "their instability to small changes in the learning data [...] the entire tree structure could be altered if the first splitting variable [something to which we will return in detail below], or only the first cutpoint, was chosen differently because of a small change in the learning data. Because of this instability, the predictions of single trees show a high variability" ([Strobl et al. 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927982/):330). Another problem of trees is that "the prediction of single trees is piecewise constant and thus may jump from one value to the next even for small changes of the predictor values" ([Strobl et al. 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927982/):331). Both of these problems can be dealt with to some degree using so-called ensemble methods such as random forests, to which we turn now.```{r}detach(package:partykit) # housekeeping```# Ensembles of trees: random forestsRandom forests (Breiman 2001a, Hastie, Tibshirani, & Friedman 2014: Ch. 15, James et al. 2021: Section 8.2) are an instance of what are called **ensemble methods** because "an ensemble or committee of classification trees is aggregated for prediction" ([Strobl et al. 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927982/):325). In other words, a random forest consists of `ntree` different trees, but with two additional layers of randomness, namely randomness* **on the level of the data points** because random forests involve running a tree on `ntree` different data sets that were bootstrapped -- sampled with replacement -- from the original data (see SFLWR^3^, Section 5.2.1.5) and then amalgamating all those `ntree` trees;* **on the level of predictors** because at every split in every tree, only a randomly-selected subset of the predictors is eligible to be chosen, namely `mtry` predictors.Random forests usually overfit much less than individual trees or even training/test validation approaches and can be straightforwardly evaluated given that random forest accuracies are prediction, not classification, accuracies. Why/how is that? If you bootstrap with replacement from your data, then, like I mentioned above, each of the `ntree` trees will* be based only on the cases that were sampled (with replacement) this time around, i.e. their training set, which, on average, will be 63.2% of all your different cases);* not 'have seen' the, on average, remaining 100-63.2=36.8% of the cases that were not sampled for the current tree, i.e. cases that the current tree was not trained on, which are referred to as the **held-out** cases or as the **OOB (out of bag)** cases.Random forests are often excellent at predictions. But there is also at least one disadvantage of random forests: While they excel at prediction (often outperforming regression models and individual trees in that sense), interpreting them can be hard: There is no single model, no single tree from a single data set that can be straightforwardly plotted -- there's a forest of, say, 500 different trees fit on 500 differently-sampled data sets, with thousands of different splits in different locations in trees with differently-sampled predictors available for splits. And because of precisely that I hope to convince you that the practice of trying to appealingly interpret/summarize such a forest of `ntree` trees with a single tree fitted to the whole data set is more than problematic; see James et al. (2021:343) for a comment regarding trees/forests to that effect:"Recall that one of the advantages of decision trees is the attractive and easily interpreted diagram that results, such as the one displayed in Figure 8.1. However, when we bag a large number of trees, it is no longer possible to represent the resulting statistical learning procedure using a single tree [...]"With regard to the **size or importance of an effect/predictor**, the random forest implementations that seem to be most widely used in linguistics -- `randomForest::randomForest` and `party::cforest`/`partykit::cforest` -- offer the functionality of computing so-called **variable importance scores**, some version of which is regularly reported. Variable importance scores basically answer the question "how much (if at all) does a certain predictor contribute to predicting the response?".*****Warning**: Note the exact formulation I'm using here: There's an important reason why I did *not* simply say "how much does a certain predictor predict the response?" or "how much is a predictor correlated with the response?".***With regard to the **nature/direction of an effect**, on the other hand, one can report **partial dependence scores**, which answer the question "how do values/levels of a certain predictor help predict the response?", but for no good reason I see those much less often. Let's now discuss how this all works; thankfully, in terms of R code, it's all pretty (but again deceptively) straightforward.## Forests of classification and regression treesLet's revisit the genitive data using the maybe best-known implementation in `randomForest::randomForest` ([Liaw & Wiener 2002](https://www.r-project.org/doc/Rnews/Rnews_2002-3.pdf)). The following shows that we use the familiar formula notation, but we also tweak a few hyperparameters and arguments (the book discusses more):* `ntree`: how many trees we want the forest to have; the default is 500 but since the data set is not that big, we can straightforwardly increase that number because "[i]n general, the higher the number of trees, the more reliable the prediction and the interpretability of the variable importance" ([Strobl et al. 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927982/):343);* `mtry`: how many predictors we want to be available at each split; the default for a forest of + classification trees is the square root of the number of predictors in the formula + regression trees is ^1^/~3~ of the number of predictors; + note that "[w]hen predictor variables are highly correlated, [...] a higher number of randomly preselected predictor variables is better suited to reflect conditional importance" ([Strobl et al. 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927982/):343);* `importance`: whether we want predictor importance to be computed (the default is `FALSE`).```{r genitiveRF}library(randomForest)# set a replicable random number seedset.seed(sum(utf8ToInt("Bacon bits"))); (rf_1 <- randomForest(GENITIVE ~ LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF, data=d, # the data are in d ntree=1500, # how many trees to fit/how many data sets to ... mtry=3, # how many variables are eligible at each split importance=TRUE)) # compute importance scores```The output shows that our forest obtained a prediction accuracy of 1-0.1428=0.8572 and that the classification/prediction error was much higher for *s* (0.28) than for *of* (0.099), but of course we want all individual predictions and the evaluation of our usual confusion matrix:```{r genitiveRFPRED}d$PRED_PP_s <- predict( # make d$PRED_PP_s the result of predicting rf_1, # from rf_1 type="prob")[,"s"] # predicted probabilities of "s"d$PRED_CAT <- predict(rf_1) # categorical predictionsd$PRED_PP_obs <- ifelse( # d$PRED_PP_obs is determined by ifelse d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response d$PRED_PP_s, # take its predicted probability 1-d$PRED_PP_s) # otherwise take 1 minus its predicted probabilityd$CONTRIBS2LL <- -log(d$PRED_PP_obs)(c_m <- table( # confusion matrix: cross-tabulate "OBS" =d$GENITIVE, # observed genitives in the rows "PREDS"=d$PRED_CAT)) # predicted orders in the columnsc( # evaluate the confusion matrix "Prec. for s" =c_m[ "s", "s"] / sum(c_m[ , "s"]), "Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s", ]), "Prec. for of" =c_m["of","of"] / sum(c_m[ ,"of"]), "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of", ]), "Acc. (overall)" =mean(d$GENITIVE==d$PRED_CAT))c("Cohen's kappa"=cohens.kappa(c_m)[[1]], "C"=C.score(d$GENITIVE, d$PRED_PP_s))```Better than the trees but again not by that much. But what about R-squared? For that, we need to compute the deviance of the forest manually and the plug that into our usual equation for McFadden's *R*^2^:```{r}"-"(deviance(cart_0),sum(-log(d$PRED_PP_obs)) *2) /deviance(cart_0)```?! Why doesn't this work?```{r}summary(d$PRED_PP_obs)```How do we address this? By smoothing or clipping; here's a very didactically detailed way to do this:```{r}offset <-min( d$PRED_PP_obs %>% unique %>% sort %>%"["(2), d$PRED_PP_obs %>% unique %>%sort(TRUE) %>%"["(2) %>%"-"(1, .)) /2d$PRED_PP_obs[d$PRED_PP_obs==0] <- offsetd$PRED_PP_obs[d$PRED_PP_obs==1] <-1-offsetsummary(d$PRED_PP_obs)```Now it works:```{r}"-"(deviance(cart_0),sum(-log(d$PRED_PP_obs)) *2) /deviance(cart_0)```But which variables are driving this prediction accuracy most? We can get those from `rf_1` with the function `importance`:```{r genitiveRFVARIMP}varimps <- rf_1$importance[,3:4] # make varimps the variable importances of rf_1varimps[order(varimps[,1]),]```Apparently, `POR_ANIMACY` is by far the most important variable, followed by `LEN_PORmPUM_LOG`, which nicely replicates the result from `tree::tree`.I've seen many papers/talks that leave it at this or do something really (sorry ...) self-contradictory, namely discuss the nature/direction of the effects in a monofactorial (e.g., chi-squared based) way, but we'll do better and look at partial dependence scores, which we'll compute with `pdp::partial`; when we do so, I always set the `which.class` argument to the second level of the response because I'm so used to that from regression modeling (just like I did above for computing `d$PRED_PP_s`). Let's look at scores for one categorical predictor (`POR_ANIMACY`) and one numeric predictor (`LEN_PORmPUM_LOG`):```{r genitiveRFPDP}(pd_pa <- partial( # make pd_pa contain partial dependence scores object=rf_1, # from this forest pred.var="POR_ANIMACY", # for this predictor which.class=2, # for the 2nd level of the response prob=TRUE, # return results on the probability scale train=d)) # these were the training datapd_ld <- partial( # make pd_ld contain partial dependence scores object=rf_1, # from this forest pred.var="LEN_PORmPUM_LOG", # for this predictor which.class=2, # for the 2nd level of the response prob=TRUE, # return results on the probability scale train=d) # these were the training datapd_ld[sort(sample(nrow(pd_ld), 5)),] # here's a sample of 5 rows```And then we could plot those:```{r}#| label: fig-pdprf1_basic#| fig-cap: Partial dependence plots for rf_1 (basic)#| fig-width: 10#| fig-show: holdpar(mfrow=c(1, 2))plot(pd_pa); grid()plot(pd_ld, type="l"); grid()par(mfrow=c(1, 1)) # reset to default```This is very very similar to the effects plots we used for the regression results from before, and it's interpretationally also very similar to the tree results. In my own work, I'd probably plot something like this, though -- with bar widths reflecting the frequencies of the levels of `POR_ANIMACY` and point sizes reflecting the frequencies of the differences clause lengths:```{r}#| echo: false#| label: fig-pdprf1_better#| fig-cap: Partial dependence plots for rf_1 (nicer)#| fig-width: 10#| fig-show: holdpar(mfrow=c(1, 2)) # make the plotting window have 1 rows & 2 columnsbarplot( # make a bar plotmain="Partial dep. of GENITIVE on POR_ANIMACY", # w/ this headingcol="darkgrey", # grey barsheight=pd_pa$yhat, # whose heights are the PDP scoresxlab="Animacy level", # x-axis labelylab="Partial dependence (pred.prob.) (for 's')", # y-axis labelylim=c(0, 1), # y-axis limits# look up ?abbreviate, which I use to make the names fit:names.arg=abbreviate(pd_pa$POR_ANIMACY, 4), # label the bars like this# make the widths of the bars represent the proportions of CONJwidth=prop.table(table(d$POR_ANIMACY))); grid() # add a gridplot( # make a plotmain="Partial dep. of GENITIVE on length differences", # w/ this headingtype="b", pch=16, # lines & points (filled circles)xlab="Difference of logged lengths", # x-axis labelylab="Partial dep. score (for sc-mc)", # y-axis labelylim=c(0, 1), # y-axis limitsx=pd_ld$LEN_PORmPUM_LOG, # x-coords: the lengthsy=pd_ld$yhat, # y-coords: PDP scorescex=1+prop.table(table(cut(d$LEN_PORmPUM_LOG, 40)))*15) # point sizes = freqs of the lengthsabline(v=quantile(d$LEN_PORmPUM_LOG, probs=seq(0, 1, 0.1)), col="grey", lty=3) # x-axis decilespar(mfrow=c(1, 1)) # reset to default```No surprises there ...```{r}detach(package:randomForest) # housekeeping```## Forests of conditional inference treesBuilding a forests of conditional inference trees is only a bit different from the use of `randomForest` above:^[By the way, here's an interesting little nugget that I've never seen or heard acknowledged in linguistics: This is what the documentation of `cforest` for both `party` (at least till version 1.3-5) and `partykit` (at least till version 1.2-10) says about itself (my emphasis): "Ensembles of conditional inference trees **haven't yet been extensively tested**, so this routine is meant **for the expert user only** and **its current state is rather experimental**" ... Also note that the use of parallelization leads to random and *not* replicable results because, I’m suspecting, the random number seed is not being passed on to the different cores/threads correctly, but, while fixes to that are available, the package maintainer maintains this is beyond his control.]```{r genitiveCIF}library(partykit)# set a replicable random number seedset.seed(sum(utf8ToInt("Gummibären"))); cf_1 <- cforest(GENITIVE ~ LEN_PORmPUM_LOG + POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF, data=d, # the data are in d ntree=1500, # how many trees to fit/how many data sets to ... perturb=list(replace=TRUE), # ... sample w/ replacement mtry=3) # how many variables are eligible at each splitsaveRDS(cf_1, file="_output/genitives_cf1.RDS")```As before, we can compute predictions and all the usual statistics:```{r genitiveCIFPRED}d$PRED_PP_s <- predict( # make d$PRED_PP_s the result of predicting cf_1, # from cf_1 type="prob")[,"s"] # predicted probabilities of "s"d$PRED_CAT <- predict(cf_1) # categorical predictions# below I will show you a faster way to do thisd$PRED_PP_obs <- ifelse( # d$PRED_PP_obs is determined by ifelse d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response d$PRED_PP_s, # take its predicted probability 1-d$PRED_PP_s) # otherwise take 1 minus its predicted probabilityd$CONTRIBS2LL <- -log(d$PRED_PP_obs)(c_m <- table( # confusion matrix: cross-tabulate "OBS" =d$GENITIVE, # observed genitives in the rows "PREDS"=d$PRED_CAT)) # predicted orders in the columnsc( # evaluate the confusion matrix "Prec. for s" =c_m[ "s", "s"] / sum(c_m[ , "s"]), "Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s", ]), "Prec. for of" =c_m["of","of"] / sum(c_m[ ,"of"]), "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of", ]), "Acc. (overall)" =mean(d$GENITIVE==d$PRED_CAT))c("Cohen's kappa"=cohens.kappa(c_m)[[1]], "C"=C.score(d$GENITIVE, d$PRED_PP_s))```Similarly, we can compute variable importance scores with `varimp`, which computes importance measures whose logic is comparable to the permutation-accuracy one used in `randomForest::importance` above, where predictors' associations with the response gets destroyed to measure the impact this has on prediction accuracy. However, with an argument called `conditional` you can choose how the permutation is done:* `conditional=FALSE` refers to the default of randomly permuting the column of interest;* `conditional=TRUE` means that the variable of interest is permuted in a way that takes its correlation with other predictors into consideration and can avoid exaggerating the importance of predictors that are correlated with other predictors (which is also why computing conditional scores takes *considerably* longer than computing unconditional ones or might not even be computable at all):```{r genitiveCIFVARIMP}sort(varimp(cf_1)) # sort the variable importance scores of cf_1# sort(varimp(cf_1, # sort the variable importance scores of cf_1, but# conditional=TRUE, # the conditional version# cores=10)) # how many threads to use, adjust as needed (set to 1 on Windoze)```Let's use the `pdp` package already introduced above to again compute partial dependence scores for just two predictors:```{r genitiveCIFPDP}(pd_pa <- partial( # make pd_pa contain partial dependence scores object=cf_1, # from this forest pred.var="POR_ANIMACY", # for this predictor which.class=2, # for the 2nd level of the response prob=TRUE, # return results on the probability scale train=d)) # these were the training datapd_ld <- partial( # make pd_ld contain partial dependence scores object=cf_1, # from this forest pred.var="LEN_PORmPUM_LOG", # for this predictor which.class=2, # for the 2nd level of the response prob=TRUE, # return results on the probability scale train=d) # these were the training datapd_ld[seq(1, 51, 10),] # here's a sample of 5 rows```And then we could plot those with either the standard plots ...```{r}#| label: fig-pdpcf1_basic#| fig-cap: Partial dependence plots for cf_1 (basic)#| fig-width: 10#| fig-show: holdpar(mfrow=c(1, 2)) # make the plotting window have 1 rows & 2 columnsplot(pd_pa); grid()plot(pd_ld, type="l"); grid()par(mfrow=c(1, 1)) # reset to default```... or our nicer ones:```{r}#| echo: false#| label: fig-pdpcf1_better#| fig-cap: Partial dependence plots for cf_1 (nicer)#| fig-width: 10#| fig-show: holdpar(mfrow=c(1, 2)) # make the plotting window have 1 rows & 2 columnsbarplot( # make a bar plotmain="Partial dep. of GENITIVE on POR_ANIMACY", # w/ this headingcol="darkgrey", # grey barsheight=pd_pa$yhat, # whose heights are the PDP scoresxlab="Animacy level", # x-axis labelylab="Partial dependence (pred.prob.) (for 's')", # y-axis labelylim=c(0, 1), # y-axis limits# look up ?abbreviate, which I use to make the names fit:names.arg=abbreviate(pd_pa$POR_ANIMACY, 4), # label the bars like this# make the widths of the bars represent the proportions of CONJwidth=prop.table(table(d$POR_ANIMACY))); grid() # add a gridplot( # make a plotmain="Partial dep. of GENITIVE on length differences", # w/ this headingtype="b", pch=16, # lines & points (filled circles)xlab="Difference of logged lengths", # x-axis labelylab="Partial dep. score (for sc-mc)", # y-axis labelylim=c(0, 1), # y-axis limitsx=pd_ld$LEN_PORmPUM_LOG, # x-coords: the lengthsy=pd_ld$yhat, # y-coords: PDP scorescex=1+prop.table(table(d$LEN_PORmPUM_LOG))*20) # point sizes = freqs of the lengths# cex=1+prop.table(table(cut(d$LEN_PORmPUM_LOG, 40)))*15) # point sizes = freqs of the lengthsabline(v=quantile(d$LEN_PORmPUM_LOG, probs=seq(0, 1, 0.1)), col="grey", lty=3) # x-axis decilespar(mfrow=c(1, 1)) # reset to default```Here, too, the overall picture is similar to what we found before -- just make sure you write it up well: We need variable importance scores (to know what's doing how much in our data) *and* partial dependence scores (to know what is doing what in our data). Let's now look at another non-linguistic application.# Women and children first?We will use a non-linguistic example here simply because it's such a great data set and question: When the Titanic sank in 1912, does the pattern of who survived and who didn't suggest that people were following the [women-and-children-first principle](https://en.wikipedia.org/wiki/Women_and_children_first)? The [Titanic data](_input/titanic.csv) are available as a data frame of the usual kind, which contains [this information](_input/titanic.r).Load the data and use a tree-based approach to study the above question.```{r}rm(list=ls(all.names=TRUE)); invisible(gc())summary(d <-read.delim("_input/titanic.csv", stringsAsFactors=TRUE))invisible(sapply(dir("_helpers", full.names=TRUE), source))```## Exploration and preparationLet's do a very quick exploration just to get a feel for what's going on:```{r}ftable(d$CLASS, d$SEX, d$AGE , d$SURVIVED) # perspective 1ftable(d$SEX , d$AGE, d$CLASS, d$SURVIVED) # perspective 2```Ok, so this data set could become tricky for regression modeling, depending on how one proceeds: There's a structural zero in the data (no children in the crew) and the number of children in general is very small, which could lead to regression trouble.Let's also compute the baseline that our analysis will try to beat and the response's deviance:```{r}(baseline <-max( # make baseline.1 the highestprop.table( # proportion in thetable(d$SURVIVED)))) # frequency table of the response variabledeviance(glm(SURVIVED ~1, family=binomial, data=d))```## Fitting a conditional inference forestOk, let's fit the forest:```{r titanicCIF}set.seed(sum(utf8ToInt("Kackvogel"))); cf_1 <- cforest(SURVIVED ~ # a cond. inf. forest CLASS + SEX + AGE, # w/ these predictors data=d, # the variables are in x ntree=750, # how many trees to fit/how many data sets to ... perturb=list(replace=TRUE), # ... sample w/ replacement mtry=2) # how many variables are eligible at each split```## Evaluating the forest's performanceHow well does that forest do in terms of prediction accuracy, precision, recall, etc.?```{r titanicPRED}d$PRED_PP_y <- predict( # make d$PRED_PP_y the result of predicting cf_1, # from cf_1 type="prob")[,"yes"] # predicted probabilities of "yes"d$PRED_CAT <- predict(cf_1) # categorical predictions(c_m <- table(OBS=d$SURVIVED, # cross-tabulate observed survival outcomes in the rows PREDS=d$PRED_CAT)) # predicted survival outcomes in the columnsd$PRED_PP_obs <- ifelse( # d$PREDS.NUM.t.obs is determined by ifelse d$SURVIVED=="yes", # if the obs outcome is the 2nd level of the response d$PRED_PP_y, # take its predicted probability 1-d$PRED_PP_y) # otherwise take 1 minus its predicted probabilityc( # evaluate the confusion matrix "Pred. acc." =mean(d$SURVIVED==d$PRED_CAT), "Prec. for yes"=c_m["yes","yes"] / sum(c_m[ ,"yes"]), "Rec. for yes" =c_m["yes","yes"] / sum(c_m["yes",]), "Prec. for no" =c_m[ "no","no"] / sum(c_m[ , "no"]), "Rec. for no" =c_m[ "no","no"] / sum(c_m[ "no",]))c("Cohen's kappa"=cohens.kappa(c_m)[[1]], "C"=C.score(d$SURVIVED, d$PRED_PP_y))```## Variable importance(s)Which variables are most important for making these predictions?```{r titanicVARIMP}sort(varimp(cf_1)) # sort the variable importance scores of cf_1sort(varimp(cf_1, # sort the variable importance scores of cf_1, but conditional=TRUE, # the conditional version cores=10)) # how many threads to use, adjust as needed (set to 1 on Windoze)```Same order for both ...## Partial dependence scoresAnd how are these variables related to the prediction of the second level of the response variable, `SURVIVED`: *yes*?```{r titanicPDP}(pd_s <- partial( # make pd.c contain partial dependence scores object=cf_1, # from this forest pred.var="SEX", # for this predictor which.class=2, # for the 2nd level of the response prob=TRUE, # return results on the probability scale train=d)) # these were the training data(pd_a <- partial( # make pd.c contain partial dependence scores object=cf_1, # from this forest pred.var="AGE", # for this predictor which.class=2, # for the 2nd level of the response prob=TRUE, # return results on the probability scale train=d)) # these were the training data(pd_c <- partial( # make pd.c contain partial dependence scores object=cf_1, # from this forest pred.var="CLASS", # for this predictor which.class=2, # for the 2nd level of the response prob=TRUE, # return results on the probability scale train=d)) # these were the training data```This is certainly looking like women-and-children first: We're predicting `SURVIVED`: *yes* and* for `SEX`, the score for *female* is much much higher than that for *male*;* for `AGE`, the score for *child* is much higher than that for *adult*;* for `CLASS`, the scores go down as you get to lower booking classes, but crew members score between 1st and 2nd class.I'm not sure this rises to a level of complexity that requires visualization, but SFLWR^3^ provides some plots for this. However, this might all be premature, as we will see now ...# Excursus on importance scores and interactionsTrees and especially forests have become increasingly frequent in linguistic applications in part because this family of approaches is said to have several attractive/advantageous features. These include the perceptions (!) that* tree-based approaches are often seen as a **good 'plan B'**, i.e. a methodological fall-back position when regression modeling is seen as too difficult or (near) impossible for a certain data set;* tree-based methods **outperform** regression modeling in many -- most? -- respects (and I'm hoping people mean that like I did above, namely as 'outperform in terms of prediction')* the visualization of at least trees is **more intuitively interpretable** than the summary/coefficients table usually reported for regression models;* trees are particularly good, or even better than regression modeling, at **exploring/revealing interactions** and at handling non-linearities;* fitting trees/forests seems **'so easy'**: Even for data that would be very hard to model with statistical-modeling tools such as regression methods and would, therefore, require much preparation, exploration, diagnostics, and (cross-)validation, machine-learning tools like trees seem to take one or two lines of code and then you have a random forest with a good prediction accuracy (without diagnostics, validation, etc.), and easy-to-obtain variable importance scores.However, the situation is not quite as simple and clear-cut. First, the ease of interpretation of trees is often overstated, as can be seen when you compare the following two plots, which represent the result of a tree (left panel) and of a regression (right panel) addressing the question whether the fuel efficiency of a car (`mpg`) is correlated with its weight (`wt`) (in the mtcars data set):```{r}#| echo: false#| label: fig-slope_treeVregr#| fig-cap: A slope in a regression tree vs. in a regression#| fig-width: 10#| fig-show: holdpar(mfrow=c(1,2))plot(qwe <- tree::tree(mpg ~ wt, data=mtcars)); grid()axis(2); mtext("Deviance", 2, 3)text(qwe, pretty=0, all=TRUE)plot(pch=16,xlab="Weight", x=mtcars$wt,ylab="MPG" , y=mtcars$mpg); grid() m <-lm(mtcars$mpg ~poly(mtcars$wt, 2))lines(mtcars$wt[order(mtcars$wt)], predict(m)[order(mtcars$wt)])par(mfrow=c(1,1))```Also, note that the regression model makes more fine-grained predictions (and heterogeneity of predictions is in part a reflection of discriminatory/predictive power):```{r}#| echo: false#| label: fig-pred_treeVregr#| fig-cap: Predictions from a regression and the treeplot(pch=16,xlab="Weight", x=mtcars$wt,ylab="Observed/black & predicted/grey MPG", y=mtcars$mpg); grid() m <-lm(mtcars$mpg ~poly(mtcars$wt, 2))lines(mtcars$wt[order(mtcars$wt)], predict(m)[order(mtcars$wt)])abline(v=c(2.26, 3.325, 3.49), lty=2)text(c(2.26, 3.325), c(15, 30), c(2.26, 3.325), srt=90, adj=c(0.5, -0.25))text(3.49, 30, 3.49, srt=90, adj=c(0.5, 1.25))predictions <-as.numeric(names(rev(table(predict(qwe)))))segments(0, predictions[1], 2.26, predictions[1], col="#00000030", lwd=8)segments(2.26, predictions[2], 3.325, predictions[2], col="#00000030", lwd=8)segments(3.325, predictions[3], 3.49, predictions[3], col="#00000030", lwd=8)segments(3.49, predictions[4], 6, predictions[4], col="#00000030", lwd=8)```Finally, with regard to ease of interpretability, some trees can be nightmarishly complex. You saw the conditional inference tree on the genitive data above that was extremely complex (and wasn't even significantly better than a very small tree from `tree::tree`) and here is Figure 1 of Bernaisch, Gries, & Mukherjee (2014:17), which is also not exactly fun:{#fig-treeBGM}Another, much more important issue is concerned with the notion of interactions, in particular their identifiability and their relation to variable importance scores. Let's first consider how an interaction would be represented in a tree. Consider a scenario (from SFLWR^3^: Section 5.2.4.2) where you have* a numeric response `LENGTH`;* two binary predictors: + `CLAUSE`: *main* vs. *subord*; + `GRAMREL`: *obj* vs. *subj*.Since interactions are technically defined as "something doesn’t do the same everywhere/always" (SFLWR^3^: Section 5.2.4.2), a possible interaction might be that* the difference between object and subject lengths is positive when `CLAUSE` is *main* but* the difference between object and subject lengths is negative when `CLAUSE` is *subord*.In a conditional inference and a regression tree, this might look like this:```{r}#| echo: false#| label: fig-inactINtree#| fig-cap: A 2-by-2 interaction in regression trees#| fig-width: 10set.seed(1)s2 <-expand.grid(CASEPERCONDITION=1:30, CLAUSE=c("main", "subord"), GRAMREL=c("obj", "subj"))s2$GRAMREL <-factor(s2$GRAMREL, levels=levels(s2$GRAMREL)[2:1])s2$LENGTH <-c(rnorm(n=30, mean=5.0, sd=0.1), rnorm(n=30, mean=4.0, sd=0.1),rnorm(n=30, mean=2.0, sd=0.1), rnorm(n=30, mean=6.0, sd=0.1))plot(partykit::ctree(LENGTH ~ GRAMREL + CLAUSE, data=s2))plot(qwe <- tree::tree(LENGTH ~ GRAMREL + CLAUSE, data=s2))axis(2); mtext("Deviance", 2, 3)text(qwe, pretty=0, all=TRUE)```But there are cases where trees are not as great at interactions as one might think. As the depth/complexity of a tree grows, the interactions it can reveal become more and more intricate or specialized; it's not uncommon to look at a more complex tree and see 4- or 5-way interactions, which are (i) extremely hard to describe and interpret and (ii) probably often not really explainable (let alone motivatable *a priori*). But sometimes it's even worse: Sometimes, tree-based approaches don't even see what they are supposedly so good at and/or provide quite importance results that are ambiguous and regularly misinterpreted.Consider the following tiny data set of three binary predictors and one binary response (the *s* in the variable names stands for 'small'):```{r}P1s <-factor(rep(c("b","a","b","a"), c(1,7,9,3)) ) # predictor 1 smallP2s <-factor(rep(c("e","f","e"), c(6,10,4))) # predictor 2 smallP3s <-factor(rep(c("m","n","m","n"), c(6,4,6,4))) # predictor 3 smallRs <-factor(rep(c("x","y"), c(10, 10))) # response smallDs <-data.frame(P1s, P2s, P3s, Rs)```The relation between the predictors and the response is such that `P1s` is most associated with the response `Rs`, `P2s` is less associated with `Rs`, and `P3s` is not associated with `Rs` at all:```{r}table(P1s, Rs) # accuracy=0.7table(P2s, Rs) # accuracy=0.6table(P3s, Rs) # accuracy=0.5```However, there's an interaction effect here such that `P2s:P3s` predicts `Rs` perfectly, i.e. with an accuracy of 1 (100%):```{r}ftable( # make a flat contingency table P2s, P3s, # these things in the rows Rs) # this in the columns```Why is this an interaction? Because, for instance, `P3s`: *m* doesn't 'do the same everywhere':* it 'promotes' `R2`: *x* when `P2s` is *e*, but* it 'promotes' `R2`: *y* when `P2s` is *f*.This data set exemplifies something that is at least relatable to the so-called **XOR problem**, "a situation where two variables show no main effect [not true of `P2s` here, though, which has a 'main effect'] but a perfect interaction. In this case, because of the lack of a marginally detectable main effect, none of the variables may be selected in the first split of a classification tree, and the interaction may never be discovered" ([Strobl et al. 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927982):341). And indeed, `tree::tree` doesn't see the perfectly predictive interaction -- instead, it returns a tree with an accuracy of 0.75 that uses the interaction `P1s:P3s` (Note that `rpart::rpart` doesn't fare better, neither does `party(kit)::ctree`, which makes no split at all.)```{r}#| label: fig-hyptreesmall#| fig-cap: A tree on the (small) example data#| fig-height: 4plot(tree_tree <- tree::tree(Rs ~ P1s+P2s+P3s)); grid() # plot the classification treeaxis(2); mtext("Deviance", 2, 3) # add a useful y-axistext(tree_tree, pretty=0, all=TRUE) # add all full labels to ittext( # plot textx=2.5, y=21, # at these coordinatespaste0("accuracy=", # this string, followed by the accuracymean(predict(tree_tree, type="class")==Rs)))```But then, maybe this is because the data set is so small? Let's increase it by one order of magnitude (the *l* in the variable names stands for 'larger') and try again :```{r}#| label: fig-hyptreelarge#| fig-cap: A tree on the (large) example data#| fig-height: 4P1l <-rep(P1s, 10) # predictor 1 largerP2l <-rep(P2s, 10) # predictor 2 largerP3l <-rep(P3s, 10) # predictor 3 largerRl <-rep(Rs, 10) # response largerDl <-data.frame(P1l, P2l, P3l, Rl)plot(tree_tree <- tree::tree(Rl ~ P1l+P2l+P3l)); grid() # plot the classification treeaxis(2); mtext("Deviance", 2, 3) # add a useful y-axistext(tree_tree, pretty=0, all=TRUE) # add all full labels to ittext( # plot textx=5.5, y=125, # at these coordinatespaste0("accuracy=", # this string, followed by the accuracymean(predict(tree_tree, type="class")==Rl)))```Now we get perfect accuracy (also from `rpart::rpart` or `party(kit)::ctree`), but the result is still not great in terms of parsimony: We're not easily shown that `P1l` is not needed at all -- we're essentially presented with a 3-way interaction when a 2-way interaction is all that's needed. The reason is that "the split selection process in regular classification trees is only locally optimal in each node: A variable and cutpoint are chosen with respect to the impurity reduction they can achieve in a given node defined by all *previous splits*, but *regardless of all splits yet to come*" ([Strobl et al. 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927982):333, my emphasis). In other words, in all cases so far, the algorithm sees that `P1` has the highest monofactorial association and runs with it, but it never 'backtracks' to see whether starting with a split on the 'monofactorially inferior' `P2` or `P3` would ultimately lead to an overall superior outcome. But this is of course where random forests should be able to help, because, in some of the trees of the forest,* the random sampling of the data might have (i) weakened the association of `P1l` to `Rl` and (ii) strengthened the association of `P2l` or `P3l` to `Rl`, maybe giving these monofactorially weaker predictors 'a leg up' to be chosen on the first split;* `P1l` will not be available for the first split, so that `P2l` will be chosen instead, and if then `P3l` is available for the second split, the tree will get perfect accuracy (and such results will augment these variables' importance scores).Indeed, this is what happens:```{r}#| collapse: trueset.seed(sum(utf8ToInt("Knoblauchwürste"))); rf_l <- randomForest::randomForest( Rl ~ P1l+P2l+P3l, mtry=2)mean(predict(rf_l)==Rl) # prediction accuracy```Prediction accuracy goes up to 1, which is great, but look at these results:```{r}# variable importancerandomForest::importance(rf_l)# compute partial dependence scores(pd_1 <-partial( # make pd_1 contain partial dependence scoresobject=rf_l, # from this forestpred.var="P1l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl)) # these were the training data(pd_2 <-partial( # make pd_2 contain partial dependence scoresobject=rf_l, # from this forestpred.var="P2l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl)) # these were the training data(pd_3 <-partial( # make pd_3 contain partial dependence scoresobject=rf_l, # from this forestpred.var="P3l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl)) # these were the training data```In a sense, the variable importance scores and partial dependence scores are still not great: The fact that `P2l:P3l` is what really matters is still nowhere to be seen and even though the variable importance scores for `P3l` are really high, its partial dependence scores are really low. And all that counterintuitive mess in a data set as small and actually simple as this! Does a conditional inference forest perform better?```{r}set.seed(sum(utf8ToInt("Knoblauchwürste"))); cf_l <-cforest( Rl ~ P1l+P2l+P3l, mtry=2)mean(predict(cf_l)==Rl) # prediction accuracyvarimp(cf_l) # variable importancevarimp(cf_l, conditional=TRUE) # variable importance (conditional)(pd_1 <-partial( # make pd_1 contain partial dependence scoresobject=cf_l, # from this forestpred.var="P1l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl)) # these were the training data(pd_2 <-partial( # make pd_2 contain partial dependence scoresobject=cf_l, # from this forestpred.var="P2l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl)) # these were the training data(pd_3 <-partial( # make pd_3 contain partial dependence scoresobject=cf_l, # from this forestpred.var="P3l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl)) # these were the training data```It does a bit better when the conditional variable importance scores are used because those are (nicely) small for `P1l` -- but we still don't get the one piece of information everyone would want, that `P2l:P3l` 'does it all'. And we again have the seemingly counterintuitive result that the variable we know does nothing monofactorially -- `P3l` -- has low partial dependence scores (ok) but the by far highest variable importance score (?). So you need to understand the following (which is something I figure many users of forests do actually not know, judging from their discussions of forests -- I dare you find a discussion of forest results in linguistics that makes the following point clear and follows through on its implications): Especially if we're used to regression modeling, what we're getting shown from the variable importance score looks like a main effect, like '`P3l` (on its own) does a lot!', but what the variable importance score *really* means is '`P3l` is (somehow!) important for making good predictions', where *somehow* means `P3l` might* theoretically be a strong main effect (which, here, we know it is not from the simple cross-tabulation at the beginning);* be involved in one or more strong two-way interactions with `P1l` and/or `P2l`, i.e. the 'real' effect of `P3l` would be, or would follow from, what in regression modeling we'd write as `P1l:P3l` and/or `P2l:P3l`;* be involved in a strong three-way interactions with `P1l` and `P2l`, i.e. the 'real' effect `P3l` would be in, or would follow from, `P1l:P2l:P3l`.Thus, a high variable importance score for `P3l` can mean that any one (or more!) of the following things are important: `P3l`, `P1l:P3l`, `P2l:P3l`, or `P1l:P2lP3l`, but you're not told which of these it is. But this fact about variable importance scores is something that most papers using random forests that I've seen seem to have not fully understood. Most papers I've seen arrive at a high variable importance score of some predictor like `P3l` and then interpret the effect of that 'important' variable in a monofactorial main-effect kind of way. Many studies that used random forests will most likely have made a big monofactorial deal out of the predictors for which they obtained a high variable importance score without ever realizing that the real reason for the predictor was its participation in an interaction, and you shouldn't trust the discussion if the authors don't make it clear how they followed up on the variable importance score(s).So how can we address this? As I discuss in Gries (2020 or in SFLWR^3^: Section 7:3), we can address this by following* Forina et al. (2009), who "augment [...] the prediction data with addition of complex combinations of the data" (Kuhn & Johnson 2013:48, also see p. 57f.);* [Strobl et al. (2009:341)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927982/), who state that "an interaction effect of two variables can also be explicitly added as a potential predictor variable";* Baayen (2011:305f.), whose application of Naïve Discriminative Learning includes "pairs of features" as "the functional equivalent of interactions in a regression model".That means, we force the tree-based approach to consider the combined effect of `P2l` and `P3l` by explicitly including in the formula the combination/interaction of these two predictors or, more generally, all predictors in whose interactions one might be interested in. For this data set, it just means we create three new predictors:```{r}P1lxP2l <- P1l:P2l; P1lxP3l <- P1l:P3l; P2lxP3l <- P2l:P3lDl <-cbind(Dl, P1lxP2l, P1lxP3l, P2lxP3l)```And this solves everything for the trees (perfect accuracy and they only use `P2lxP3l`) ...```{r}(tt_l <- tree::tree(Rl ~ P1l+P2l+P3l+P1lxP2l+P1lxP3l+P2lxP3l))(ct_l <-ctree(Rl ~ P1l+P2l+P3l+P1lxP2l+P1lxP3l+P2lxP3l))```... and for the random forest ...```{r}set.seed(sum(utf8ToInt("Hanuta"))); rf_l <- randomForest::randomForest( Rl ~ P1l+P2l+P3l+P1lxP2l+P1lxP3l+P2lxP3l, mtry=2)mean(predict(rf_l)==Rl) # accuracy randomForest::importance(rf_l) # variable importancepartial( # show partial dependence scoresobject=rf_l, # from this forestpred.var="P2lxP3l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl) # these were the training data```... and for the conditional inference forest:```{r}set.seed(sum(utf8ToInt("Lachsschinken"))); cf_l <-cforest( Rl ~ P1l+P2l+P3l+P1lxP2l+P1lxP3l+P2lxP3l, mtry=2)mean(predict(cf_l)==Rl) # accuracyvarimp(cf_l) # variable importancepartial( # show partial dependence scoresobject=cf_l, # from this forestpred.var="P2lxP3l", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=Dl) # these were the training data```Now (i) every tree/forest returns perfect accuracy, (ii) the forests return the by far highest variable importance scores for `P2lxP3l`, and (iii) the partial dependence scores of that predictor reveal the interaction pattern we knew is in the data ... finally! And if you also add the three way interaction `P1lxP2lxP3l` to the formula, then `tree`, `ctree`, and `cforest` all still recover that `P2lxP3l` is most important -- only `randomForest` assigns a variable importance score to `P1lxP2lxP3l` that's slightly higher than that of `P2lxP3l`).Thus, bottom line, trees and especially more robust and powerful forests *can* be good at detecting interactions, but it's nowhere near as simple as is often thought. One should always (i) compute a forest, its accuracy, precision, and recall, but also (ii) variable importance scores and (iii) partial dependence scores. High variable importance scores for a predictor can mean many things: don't take them at face value and let them fool you into automatically assuming they reflect an important 'main effect' and then you plot/interpret that and you're done -- a high variable importance score only means 'that predictor does something, somehow' and you need to do follow-up analyses to find out what exactly that is. And studying that can be done with interaction predictors (or in other ways, see [Gries 2020](https://www.stgries.info/research/2020_STG_TreesForests_CLLT.pdf)). In fact, [Strobl et al. (2009:339)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927982/) go so far as to advise "In general, however, the complex random forest model, involving high-order interactions and nonlinearity, should be compared to a simpler model (e.g., a linear or logistic regression model including only low-order interactions) whenever possible to decide whether the simpler, interpretable model would be equally adequate". Did you catch that?? Random forests are the *more* complex models, because each tree of a forest can easily involve 3-, 4-, 5-, ... way interactions that people would never dream of fitting in a regression model but that they rely unquestioningly in a forest, no questions asked. Don't let the apparent simplicity of the R code for a forest trick you into believing that forests are the simpler method; part of why they are doing so much better in predictions than regressions is that they can involve very complex partitioning of the data.Given this underappreciated (by many) complexity of random forests, [Strobl et al. (2009:341)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927982/) conclude that```random forests can at least serve as a benchmark predictor: If a linear or other parametric model with a limited number and degree of interaction terms can reach the (cross-validated or test sample) prediction accuracy of the more complex model, the extra complexity may be uncalled for and the simpler, interpretable model should be given preference. If, however, the prediction accuracy cannot be reached with the simpler model, and, for example, the high importance of a variable in a random forest is not reflected by its respective parameters in the simpler model, relevant nonlinear or interaction effects may be missing in the simpler model, and it may not be suited to grasping the complexity of the underlying process.```In other words, 'consider doing both!' -- a regression *and* a tree-based method.# Women and children first? Part 2Let's revisit the Titanic then because, after having seen notable variable importance scores for `SEX` and `AGE`, the question becomes, what do these values for `SEX` and `AGE` reflect: main effects of, say, `SEX` ('women first'!) and `AGE` ('children first'!) or interactions that `SEX` and `AGE` participate in? Let's find out and let's include all pairwise interactions. We first add those interactions to the data frame `x`:```{r}# add interaction predictors for the two main predictors of interestd$CLASSxSEX <-droplevels(d$CLASS:d$SEX) # using droplevels to eliminated$CLASSxAGE <-droplevels(d$CLASS:d$AGE) # factor level combinations thatd$SEXxAGE <-droplevels(d$SEX:d$AGE) # aren't attested```## Fitting a conditional inference forestSo now we fit the next forest and include those:```{r}set.seed(sum(utf8ToInt("Kackvogel"))); cf_2 <-cforest( SURVIVED ~ CLASS + SEX + AGE +# w/ these predictors CLASSxSEX + CLASSxAGE + SEXxAGE, # w/ these interaction predictorsdata=d, # the variables are in dntree=2500, # how many trees to fit/how many data sets to ...perturb=list(replace=TRUE), # ... sample w/ replacementmtry=3, # how many variables are eligible at each splitcores=12) # how many threads to use, adjust as needed (set to 1 on Windoze)```## Evaluating the forest's performanceSo, how well is this doing in terms of prediction? With regard to *prediction*, there's probably not going to be much of a change because there are really only very few combinations of variable levels, which the previous forest probably dealt with ok.```{r}d$PRED_PP_y <-predict( # make d$PRED_PP_y the result of predicting cf_2, # from cf_2type="prob")[,"yes"] # predicted probabilities of "yes"d$PRED_CAT <-predict(cf_2) # categorical predictions(c_m <-table(OBS=d$SURVIVED, # cross-tabulate observed survival outcomes in the rowsPREDS=d$PRED_CAT)) # predicted survival outcomes in the columnsc( # evaluate the confusion matrix"Pred. acc."=mean(d$SURVIVED==d$PRED_CAT),"Prec. for yes"=c_m["yes","yes"] /sum(c_m[ ,"yes"]),"Rec. for yes"=c_m["yes","yes"] /sum(c_m["yes",]),"Prec. for no"=c_m[ "no","no"] /sum(c_m[ , "no"]),"Rec. for no"=c_m[ "no","no"] /sum(c_m[ "no",]))c("Cohen's kappa"=cohens.kappa(c_m)[[1]],"C"=C.score(d$SURVIVED, d$PRED_PP_y))```Indeed the results are the same, but the main focus of including the interaction predictors is not to boost *prediction*, but to try and see whether our *explanation* is improved so we move on to variable importance and then partial dependence:## Variable importance(s)How do the variable importance scores change now that the interaction predictors are included?```{r}sort(varimp(cf_2)) # sort the variable importance scores of cf_2sort(varimp(cf_2, # sort the variable importance scores of cf_2, butconditional=TRUE, # the conditional versioncores=12)) # how many threads to use, adjust as needed (set to 1 on Windoze)```The interactions come out on top.## Partial dependence scoresAnd now let's see what the interactions are doing:```{r}(pd_cs <-partial( # make pd.cs contain partial dependence scoresobject=cf_2, # from this forestpred.var="CLASSxSEX", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=d)) # these were the training data(pd_ca <-partial( # make pd.ca contain partial dependence scoresobject=cf_2, # from this forestpred.var="CLASSxAGE", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=d)) # these were the training data(pd_sa <-partial( # make pd.sa contain partial dependence scoresobject=cf_2, # from this forestpred.var="SEXxAGE", # for this predictorwhich.class=2, # for the 2nd level of the responseprob=TRUE, # return results on the probability scaletrain=d)) # these were the training data```This is easier to interpret when broken up a bit differently and maybe when visualized; here's the example of `CLASS:SEX`:```{r}#| echo: false#| label: fig-titanic_cs#| fig-cap: The interaction of CLASS:SEX#| fig-width: 7tapply(pd_cs$yhat, list(gsub(":.*$", "", pd_cs$CLASSxSEX), gsub("^.*?:", "", pd_cs$CLASSxSEX)), mean)tab_cs <-table(d$CLASSxSEX) # determine the frequencies of the conjunctionsqwe <-barplot( # make a bar plotmain="Partial dep. of SURVIVED on CLASSxSEX", # w/ this headingcol=rep(c("#FF000040", "#0000FF40"), 4), # pretty barsheight=pd_cs$yhat, # whose heights are the PDP scoresxlab="The crossing of CLASS and SEX", # x-axis labelylab="Partial dep. score (for 'survived')", # y-axis labelylim=c(0, 1), # y-axis limits# make the widths of the bars proportional to the proportions of CLASSxSEXwidth=sqrt(prop.table(tab_cs))); grid(); abline(h=0.5, lty=2, col="darkgrey") # add a gridtext(x=qwe,y=0.5*pd_cs$yhat,labels=substr(pd_cs$CLASSxSEX, 1, 6),srt=90, col=rep(c("#FF000080", "#0000FF80"), 4))```For `CLASS:SEX`, there is obviously some sort of interaction effect: We knew from `pd_s` that the chances of women surviving is generally much higher than that of men, but we now see that this effect doesn't hold everywhere in the same way. In other words, being female doesn't help you everywhere/always equally (see 3rd class).Here are the corresponding plots for the other two interactions:```{r}#| echo: false#| label: fig-titanic_ca#| fig-cap: The interaction of CLASS:AGE#| fig-width: 7tab_ca <-table(d$CLASSxAGE) # determine the frequencies of the conjunctionsqwe <-barplot( # make a bar plotmain="Partial dep. of SURVIVED on CLASSxAGE", # w/ this headingcol=rep(c("#8B451380", "#D1179C40"), 4), # pretty barsheight=pd_ca$yhat, # whose heights are the PDP scoresxlab="The crossing of CLASS and AGE", # x-axis labelylab="Partial dep. score (for 'survived')", # y-axis labelylim=c(0, 1), # y-axis limits# make the widths of the bars proportional to the proportions of CLASSxAGEwidth=sqrt(prop.table(tab_ca))); grid(); abline(h=0.5, lty=2, col="darkgrey") # add a gridtext(x=qwe,y=0.5*pd_ca$yhat,labels=substr(pd_ca$CLASSxAGE, 1, 6),srt=90, col=rep(c("#8B451380", "#D1179C80"), 4))``````{r}#| echo: false#| label: fig-titanic_sa#| fig-cap: The interaction of SEX:AGEtab_sa <-table(d$SEXxAGE) # determine the frequencies of the conjunctionsqwe <-barplot( # make a bar plotmain="Partial dep. of SURVIVED on SEXxAGE", # w/ this headingcol=c("#FF000080", "#FF000040", "#0000FF80", "#0000FF40"), # pretty barsheight=pd_sa$yhat, # whose heights are the PDP scoresxlab="The crossing of SEX and AGE", # x-axis labelylab="Partial dep. score (for 'survived')", # y-axis labelylim=c(0, 1), # y-axis limits# make the widths of the bars proportional to the proportions of SEXxAGEwidth=sqrt(prop.table(tab_sa))); grid() # add a gridtext(x=qwe,y=0.2,labels=pd_sa$SEXxAGE,srt=90, col=c("#FF000080", "#FF000080", "#0000FF80", "#0000FF80"))```For `CLASS:AGE`, there is some sort of interaction effect. We knew from `pd_a` that the chances of children to survive were generally much higher than that of adults, but we now see that this effect is restricted to first and second class and much less strong in 3rd class. Finally `SEX:AGE`: We knew from `pd_a` that the chances of children to survive were generally much higher than that of adults and we knew from `pd_s` that the chances of females to survive were much higher than that of males. If both these effects were perfectly additive, we'd see that female children have the highest scores for survival, but they don't: female adults do, but among females the difference between children and adults is actually very minor. However, among males, the difference between children and adults is considerably bigger.I hope this makes clear that, as discussed earlier, you do *not* want to take variable importance scores and partial dependence scores for predictors as indications of them being important just as main effects. Without the more fine-grained analysis here, our understanding of the effects would be *much* less complete; check the exercises for SFLWR^3^: Chapter 7, for a bit (including plots for the interaction) on this example.# Revisiting the subject realization dataLet us now review the subject realization data from the regression modeling part:```{r}rm(list=ls(all=TRUE)); invisible(gc())invisible(sapply(dir("_helpers", full.names=TRUE), source))summary(d <-read.delim(file="_input/subjreal.csv",stringsAsFactors=TRUE))d <-droplevels(d[complete.cases(d),])```Let's fit a conditional inference forest ...```{r subjrealCIT}set.seed(sum(utf8ToInt("Kackvogel"))); cf_1 <- cforest( SUBJREAL ~ # this response variable ~ CONTRAST + GIVENNESS, # w/ these predictors data=d, # the variables are in d ntree=2500, # how many trees to fit/how many data sets to ... perturb=list(replace=TRUE), # ... sample w/ replacement mtry=3, # how many variables are eligible at each split cores=12) # how many threads to use, adjust as needed (set to 1 on Windoze)```... and let's get the predictions:```{r subjrealPRED}# much faster way to compute the predictions:qwe <- predict(cf_1, type="prob") # all pred. probs. from cf_1d$PRED_PP_y <- qwe[,"yes"] # predicted probabilities of "yes"d$PRED_CAT <- levels(d$SUBJREAL)[apply(qwe, 1, which.max)] # categorical predictionsrm(qwe)(c_m <- table(OBS=d$SUBJREAL, # cross-tabulate obs. subj realizations in the rows PREDS=d$PRED_CAT)) # predicted subject realizations in the columnsc( # evaluate the confusion matrix "Pred. acc." =mean(d$SUBJREAL==d$PRED_CAT), "Prec. for yes"=c_m["yes","yes"] / sum(c_m[ ,"yes"]), "Rec. for yes" =c_m["yes","yes"] / sum(c_m["yes",]), "Prec. for no" =c_m[ "no","no"] / sum(c_m[ , "no"]), "Rec. for no" =c_m[ "no","no"] / sum(c_m[ "no",]))c("Cohen's kappa"=cohens.kappa(c_m)[[1]], "C"=C.score(d$SUBJREAL, d$PRED_PP_y))```Now, we have a pretty good understanding of this data set from the regression model. We *know* there's a pretty strong interaction there -- what do the variable importance scores say?```{r subjrealVARIMP}sort(varimp(cf_1)) # sort the variable importance scores of cf_1sort(varimp(cf_1, # sort the variable importance scores of cf_1, but conditional=TRUE, # the conditional version cores=12)) # how many threads to use, adjust as needed (set to 1 on Windoze)```Obviously, this does again not address the issue of the interaction ... But here's a way to now deal with interactions of a numeric and categorical predictor:```{r subjrealPDP}pd_cg <- partial( # make pd.cs contain partial dependence scores object=cf_1, # from this forest pred.var=c("GIVENNESS", "CONTRAST"), # for these predictors <----------------- which.class=2, # for the 2nd level of the response prob=TRUE, # return results on the probability scale train=d) # these were the training data```Let's change the output format to something slightly easier to look at and also plot this to see what it means:```{r}(qwe <-tapply(pd_cg$yhat, list(pd_cg$GIVENNESS, pd_cg$CONTRAST), mean))``````{r}#| echo: false#| label: fig-subjreal_cg#| fig-cap: The interaction of CONTRAST:GIVENNESS# a table for font sizes:tab_cg <-log10(table(d$GIVENNESS, d$CONTRAST)+1)plot(pch="n", type="b", cex=tab_cg[,"no"],xlab="Givenness" , xlim=c(0, 10), x=0:10,ylab="Pred. prob. of 'yes'", ylim=c(0, 1 ), y=qwe[,"no"]); grid()lines(pch="y", type="b", cex=tab_cg[,"yes"],x=0:10, y=qwe[,"yes"])```Now we're talking ...```{r}sessionInfo()```