Ling 204: session 09

Author
Affiliations

UC Santa Barbara

JLU Giessen

Published

04 Mar 2026 11-34-56

1 Session 09: Random forests

We are dealing with the same data set as in session 1; as a reminder, the data are in _input/genitives.csv and you can find information about the variables/columns in _input/genitives.r.

rm(list=ls(all.names=TRUE))
library(magrittr); library(randomForest); library(pdp)
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/genitives.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
      CASE      GENITIVE  SPEAKER       MODALITY      POR_LENGTH
 Min.   :   2   of:2720   nns:2666   spoken :1685   Min.   :  1.00
 1st Qu.:1006   s : 880   ns : 934   written:1915   1st Qu.:  8.00
 Median :2018                                       Median : 11.00
 Mean   :2012                                       Mean   : 14.58
 3rd Qu.:3017                                       3rd Qu.: 17.00
 Max.   :4040                                       Max.   :204.00
   PUM_LENGTH         POR_ANIMACY   POR_FINAL_SIB        POR_DEF
 Min.   :  2.00   animate   : 920   absent :2721   definite  :2349
 1st Qu.:  6.00   collective: 607   present: 879   indefinite:1251
 Median :  9.00   inanimate :1671
 Mean   : 10.35   locative  : 243
 3rd Qu.: 13.00   temporal  : 159
 Max.   :109.00                                                     

We are again asking, does the choice of a genitive construction (of vs. s) vary as a function of

  • all the predictors that are already part of the data frame, i.e.
    • the categorical predictors SPEAKER, MODALITY, POR_ANIMACY, POR_FINAL_SIB, POR_DEF;
    • the numeric predictors POR_LENGTH and PUM_LENGTH;
  • an additional new length-based predictor, namely how the length of the possessor POR_LENGTH compares to the length of the possessum (PUM_LENGTH) (expressed as a difference); since such a comparison variable doesn’t exist yet in our data set, we need to create it first.

However, this session, we will deal with this using random forests.

1.1 Exploration & preparation

Some exploration of the relevant variables:

table(d$GENITIVE, d$SPEAKER)       %>% addmargins

       nns   ns  Sum
  of  2024  696 2720
  s    642  238  880
  Sum 2666  934 3600
table(d$GENITIVE, d$MODALITY)      %>% addmargins

      spoken written  Sum
  of    1232    1488 2720
  s      453     427  880
  Sum   1685    1915 3600
table(d$GENITIVE, d$POR_ANIMACY)   %>% addmargins

      animate collective inanimate locative temporal  Sum
  of      370        408      1638      199      105 2720
  s       550        199        33       44       54  880
  Sum     920        607      1671      243      159 3600
table(d$GENITIVE, d$POR_FINAL_SIB) %>% addmargins

      absent present  Sum
  of    1962     758 2720
  s      759     121  880
  Sum   2721     879 3600
table(d$GENITIVE, d$POR_DEF)       %>% addmargins

      definite indefinite  Sum
  of      1609       1111 2720
  s        740        140  880
  Sum     2349       1251 3600
hist(d$LEN_PORmPUM_LOG <- "-"(
   log2(d$POR_LENGTH),
   log2(d$PUM_LENGTH)),
   main="", xlab="LEN_PORmPUM_LOG")

1.2 Deviance & baseline(s)

Let’s already compute all baseline value for what will be the response variable, GENITIVE:

(nulls <- setNames(
   c(d$GENITIVE %>% table %>% prop.table %>% max,
     d$GENITIVE %>% table %>% prop.table %>% "^"(2) %>% sum,
     deviance(glm(GENITIVE ~ 1, family=binomial, data=d)),
     logLik(glm(GENITIVE ~ 1, family=binomial, data=d))),
   c("no-info rate", "proportional guessing", "null deviance", "null logLik")))
         no-info rate proportional guessing         null deviance
            0.7555556             0.6306173          4004.2729923
          null logLik
        -2002.1364962 

1.3 Random forests

1.3.1 Creating & evaluating a random forest

How about we fit a ‘regular’ random forest?

set.seed(sum(utf8ToInt("Rotzlöffel")))
(rf_1 <- randomForest(GENITIVE ~
   # categorical predictors
   SPEAKER + MODALITY + POR_ANIMACY + POR_FINAL_SIB + POR_DEF +
   # numeric predictors
   POR_LENGTH + PUM_LENGTH + LEN_PORmPUM_LOG,
   data=d,           # the data are in d
   ntree=1500,       # how many trees to fit/how many data sets to ...
   replace=TRUE,     # ... sample w/ replacement
   mtry=3,           # how many variables are eligible at each split
   keep.forest=TRUE, # retain the forest
   keep.inbag=TRUE,  # retain which points were not OOB
   importance=TRUE)) # compute importance scores

Call:
 randomForest(formula = GENITIVE ~ SPEAKER + MODALITY + POR_ANIMACY +      POR_FINAL_SIB + POR_DEF + POR_LENGTH + PUM_LENGTH + LEN_PORmPUM_LOG,      data = d, ntree = 1500, replace = TRUE, mtry = 3, keep.forest = TRUE,      keep.inbag = TRUE, importance = TRUE)
               Type of random forest: classification
                     Number of trees: 1500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 14.08%
Confusion matrix:
     of   s class.error
of 2480 240  0.08823529
s   267 613  0.30340909

How well does the forest do? Let’s compute numeric predictions (which we restrict to the second level of the response like in binary regression modeling and like in trees before) and categorical predictions:

d$PRED_PP_2 <- predict( # make d$PRED_PP_2 the result of predicting
   rf_1,                # from rf_1
   type="prob")[,"s"]   # predicted probabilities of "s"
d$PRED_CAT <- predict(rf_1) # categorical predictions
(c_m <- table(         # confusion matrix: cross-tabulate
   OBS  =d$GENITIVE,   # observed orders in the rows
   PREDS=d$PRED_CAT))  # predicted orders in the columns
    PREDS
OBS    of    s
  of 2480  240
  s   267  613

Let’s evaluate this confusion matrix with accuracy, precision(s), and recall(s):

c( # precisions & accuracies/recalls
   "Class. acc."=mean(d$GENITIVE==d$PRED_CAT, na.rm=TRUE),
   "Prec. for s"     =c_m["s","s"]   / sum(c_m[       ,"s"]),
   "Acc./rec. for s" =c_m["s","s"]   / sum(c_m["s",]),
   "Prec. for of"    =c_m["of","of"] / sum(c_m[       ,"of"]),
   "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of",]))
     Class. acc.      Prec. for s  Acc./rec. for s     Prec. for of
       0.8591667        0.7186401        0.6965909        0.9028031
Acc./rec. for of
       0.9117647 

We also compute Cohen’s κ and the C-score:

c("Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(cv.f=d$GENITIVE, d$PRED_PP_2))
Cohen's kappa             C
    0.6147351     0.9145776 

Not bad … But what about an R-squared? Usually, people do not report R-squareds for trees or forests, but I think hose can be heuristically useful Thus, let’s add a column for the deviance calculation:

d$PREDS_PP_OBS <- abs("-"( # the absolute value of the difference
   d$PRED_PP_2,            # the predicted prob. of the 2nd level
   d$GENITIVE!="s"))       # 0 when obs=2nd level, 1 when obs=1st level
d$CONTRIBS2DEV <- -log(d$PREDS_PP_OBS)

So, what is the deviance of the forest?

2*sum(d$CONTRIBS2DEV)
[1] Inf

Oops, why is that? Check summary(d$PREDS_PP_OBS) for the answer.

How do we deal with this? By adding to and subtracting from the predicted probabilities of 0 and 1 respectively half of the smallest difference between predicted probabilities:

offset <- unique(d$PREDS_PP_OBS) %>% sort %>% diff %>% min %>% "/"(2)
d$PREDS_PP_OBS[d$PREDS_PP_OBS==0] <- offset
d$PREDS_PP_OBS[d$PREDS_PP_OBS==1] <- 1-offset
# d$PREDS_PP_OBS <- min2max(d$PREDS_PP_OBS, minim=offset, maxim=1-offset)

Now this should work …

d$CONTRIBS2DEV <- -log(d$PREDS_PP_OBS)
2*sum(d$CONTRIBS2DEV)
[1] 2793.659

Thus, we can compute McFadden’s R-squared:

(nulls[3] - 2*sum(d$CONTRIBS2DEV)) / nulls[3]
null deviance
    0.3023306 

I leave it up to you to also compute Nagelkerke’s R-squared from the null model’s/forest’s log likelihood and this forests’s log likelihood.

1.3.2 Interpreting the random forest

Which variables are most important for the predictions?

varimps <- importance(rf_1)
plot(pch=16, col="#00000040",
   xlab="Mean decr accuracy", xlim=c(0, 350), x=varimps[,3],
   ylab="Mean decr Gini"    , ylim=c(0, 450), y=varimps[,4])
grid(); lines(par("usr")[1:2], par("usr")[3:4], lty=3)
text(varimps[,3], varimps[,4], rownames(varimps))
Figure 1: Variable importance scores for the random forest

And what are some predictors’ (directions of) effects? We check the most powerful categorical predictor POR_ANIMACY with the default plot for partial dependency scores in Figure 2:

(pd_a <- partial(          # make pd_a contain partial dependence scores
   object=rf_1,            # from this forest
   pred.var="POR_ANIMACY", # for this predictor
   prob=TRUE,              # return predicted probabilities
   which.class=2,          # for the 2nd level of the response
   train=d))               # these were the training data
  POR_ANIMACY       yhat
1     animate 0.55968056
2  collective 0.26962148
3   inanimate 0.04046815
4    locative 0.13424370
5    temporal 0.28097000
plot(pd_a, ylab="How much a level pushes s-genitives", ylim=c(0, 1))
abline(h=0.5, lty=2); grid() # cut-off point for decision
   text(2.5, 0.5, "cut-off"          , cex=0.75, pos=1)
abline(h=nulls[1], lty=3); grid() # baseline
   text(4.5, nulls[1], "baseline NIR", cex=0.75, pos=3)
Figure 2: Partial dependency scores for POR_ANIMACY (plot 1)

… and a maybe more informative version in Figure 3:

tab_a <- table(POR_ANIMACY=d$POR_ANIMACY) # determine the frequencies of the conjunctions
barplot(main="Partial dep. of GENITIVE on POR_ANIMACY", # make a bar plot w/ this heading
   col="darkgrey",                              # grey bars
   height=pd_a$yhat,                            # whose heights are the PDP scores
   xlab="Animacy of possessor",                 # x-axis label
   ylab="Partial dependence score (for s-gen)", # y-axis label
   ylim=c(0, 1),                                # y-axis limits
   # look up ?abbreviate, which I use to make the names fit:
   names.arg=abbreviate(pd_a$POR_ANIMACY, 4),   # label the bars like this
   # make the widths of the bars represent the proportions of POR_ANIMACY
   width=prop.table(tab_a))
abline(h=0.5, lty=2); grid() # cut-off point for decision
   text(0.6, 0.5, "cut-off"          , cex=0.75, pos=1)
abline(h=nulls[1], lty=3); grid() # baseline
   text(0.8, nulls[1], "baseline NIR", cex=0.75, pos=3)
Figure 3: Partial dependency scores for POR_ANIMACY (plot 2)

Then we do the same for one numeric predictor, LEN_PORmPUM_LOG: a default version …

pd_l <- partial(               # make pd_l contain partial dependence scores
   object=rf_1,                # from this forest
   pred.var="LEN_PORmPUM_LOG", # for this predictor
   grid.resolution=10,         # provide estimates for this many predictor values
   prob=TRUE,                  # return predicted probabilities
   which.class=2,              # for the 2nd level of the response
   train=d)                    # these were the training data
pd_l # here's what we just created
   LEN_PORmPUM_LOG      yhat
1       -4.5235620 0.4292809
2       -3.4971389 0.4303872
3       -2.4707159 0.4442854
4       -1.4442928 0.3791333
5       -0.4178697 0.2756846
6        0.6085533 0.2542856
7        1.6349764 0.2122522
8        2.6613994 0.1956402
9        3.6878225 0.1956093
10       4.7142455 0.1956093
plot(pd_l, ylab="How much a value pushes s-genitives", ylim=c(0, 1))
abline(h=0.5, lty=2); grid() # cut-off point for decision
   text(2.5, 0.5, "cut-off"          , cex=0.75, pos=1)
abline(h=nulls[1], lty=3); grid() # baseline
   text(4.5, nulls[1], "baseline NIR", cex=0.75, pos=3)
Figure 4: Partial dependency scores for LEN_PORmPUM_LOG (plot 1)

… and a better one:

pd_l$FREQ <- table(cut(d$LEN_PORmPUM_LOG,
   breaks=c(-5, (pd_l$LEN_PORmPUM_LOG[1:9] + pd_l$LEN_PORmPUM_LOG[2:10]) / 2, 5),
   include.lowest=TRUE))
pd_l$PROP <- prop.table(pd_l$FREQ)
plot(main="Partial dep. of GENITIVE on LEN_PORmPUM_LOG", # make a plot w/ this heading
   type="b", pch=16,                        # lines & points (filled circles)
   xlab="Possessor minus possessum length", # x-axis label
   ylab="Partial dep. score (for s-gen)",   # y-axis label
   ylim=c(0, 1),                            # y-axis limits
   x=pd_l$LEN_PORmPUM_LOG,                  # x-coords: the length differences
   y=pd_l$yhat,                             # y-coords: PDP scores
   cex=0.5+pd_l$PROP*10) # make the point sizes represent the frequencies of the length differences
   abline(v=quantile(d$LEN_PORmPUM_LOG, probs=seq(0.1, 0.9, 0.1)), col="grey", lty=3) # x-axis decile grid
   abline(h=seq(0, -3, -0.5), lty=3, col="#00000020") # y-axis grid
lines(lowess(pd_l$yhat ~ pd_l$LEN_PORmPUM_LOG), lwd=6, col="#BCBCBC80") # add a smoother
abline(h=0.5, lty=2); grid() # cut-off point for decision
   text(2, 0.5, "cut-off"          , cex=0.75, pos=1)
abline(h=nulls[1], lty=3); grid() # baseline
   text(4, nulls[1], "baseline NIR", cex=0.75, pos=3)
Figure 5: Partial dependency scores for LEN_PORmPUM_LOG (plot 2)

Test question for you: why might this smoother be potentially problematic (potentially, here it’s fine) and how might one fix that?

Of course, one might ideally also check for interactions …

1.3.3 Write-up

To determine whether the choice of a genitive construction (of vs. s) varies as a function of

  • the categorical predictors SPEAKER, MODALITY, POR_ANIMACY, POR_FINAL_SIB, POR_DEF;
  • the numeric predictors POR_LENGTH and PUM_LENGTH and the logged length difference between them (LEN_PORmPUM_LOG),

a random forest was fit with the following hyperparameters: ntree was set to 1500 and mtry to 3. The results indicate that the genitive alternation is predictable fairly well: While the no-information rate is 75.56% (for of), the overall accuracy of the forest is 85.9%, with separate precision and recall values for each genitive as follows: [quote precision and recall values]. The forest came with a reduction in deviance of the response of (a McFadden’s R2 of) 0.302.

Variable importance scores (the combination of mean decrease of Gini and of accuracy) indicate that two main kinds of predictors play the most important roles: the animacy of the possessor and values related to the lengths of the possessor and the possessed; of the latter, the logged lengths difference was most important. [show plot] The effect of POR_ANIMACY is that

  • there is a very strong preference for of-genitives with inanimate possessors;
  • there are weak preferences for of-genitives with collective, locative, and temporal possessors;
  • there is a weak preference for s-genitives with animate possessors.

The effect of the length difference variable boils down to a simple (and expected) short-before-long effect.

1.4 Conditional inference forests

Let’s also see whether a conditional inference forests leads to different results (not very likely, but possible, esp. with regard to variable importances):

d <- d[,1:10] # restore d to what we loaded
detach(package:randomForest); library(partykit)

1.4.1 Creating & evaluating a conditional inference forest

Here’s how we fit a conditional inference forest; in general the syntax is very similar, but there are a few tiny changes:

set.seed(sum(utf8ToInt("Gummibären"))) # set a replicable random number seed
cf_1 <- cforest(GENITIVE ~
   # categorical predictors
   SPEAKER + MODALITY + POR_ANIMACY + POR_FINAL_SIB + POR_DEF +
   # numeric predictors
   POR_LENGTH + PUM_LENGTH + LEN_PORmPUM_LOG,
   data=d,
   ntree=1500, # how many trees to fit/how many data sets to ...
   perturb=list(replace=TRUE), # ... sample w/ replacement
   mtry=3)     # how many variables are eligible at each split

We’ll proceed exactly as before: we first compute numeric and categorical predictions, then we evaluate the resulting confusion matrix:

d$PRED_PP_2 <- predict( # make d$PRED_PP_2 the result of predicting
   cf_1,                # from cf_1
   type="prob")[,"s"]   # predicted probabilities of "s"
d$PRED_CAT <- predict(rf_1) # categorical predictions
(c_m <- table(         # confusion matrix: cross-tabulate
   OBS  =d$GENITIVE,   # observed orders in the rows
   PREDS=d$PRED_CAT))  # predicted orders in the columns
    PREDS
OBS    of    s
  of 2480  240
  s   267  613
c( # precisions & accuracies/recalls
   "Class. acc."=mean(d$GENITIVE==d$PRED_CAT, na.rm=TRUE),
   "Prec. for s"     =c_m["s","s"]   / sum(c_m[       ,"s"]),
   "Acc./rec. for s" =c_m["s","s"]   / sum(c_m["s",]),
   "Prec. for of"    =c_m["of","of"] / sum(c_m[       ,"of"]),
   "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of",]))
     Class. acc.      Prec. for s  Acc./rec. for s     Prec. for of
       0.8591667        0.7186401        0.6965909        0.9028031
Acc./rec. for of
       0.9117647 
c("Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(cv.f=d$GENITIVE, d$PRED_PP_2))
Cohen's kappa             C
    0.6147351     0.9601504 
d$PREDS_PP_OBS <- abs("-"( # the absolute value of the difference
   d$PRED_PP_2,            # the predicted prob. of the 2nd level
   d$GENITIVE!="s"))       # 0 when obs=2nd level, 1 when obs=1st level
# calculate the offset and ...
offset <- unique(d$PREDS_PP_OBS) %>% sort %>% diff %>% min %>% "/"(2)
# apply it to the extreme values:
d$PREDS_PP_OBS[d$PREDS_PP_OBS==0] <- offset
d$PREDS_PP_OBS[d$PREDS_PP_OBS==1] <- 1-offset
# compute contributions to deviance and ...
d$CONTRIBS2DEV <- -log(d$PREDS_PP_OBS)
# the deviance
2*sum(d$CONTRIBS2DEV)
[1] 1693.867
# which leads to this McFadden R-squared:
(nulls[3] - 2*sum(d$CONTRIBS2DEV)) / nulls[3]
null deviance
    0.5769852 
# save because partykit is slow
save(d, cf_1, file="_output/204_09_cf1.rds")

1.4.2 Interpreting the conditional inference forest

Which variables are most important for the predictions?

(varimps_reg  <- sort(varimp(cf_1, cores=10))) # on non-Windoze, delete "cores=10"
       MODALITY         POR_DEF         SPEAKER   POR_FINAL_SIB      PUM_LENGTH
     0.04713185      0.06167438      0.06192974      0.06386651      0.07526681
     POR_LENGTH LEN_PORmPUM_LOG     POR_ANIMACY
     0.22779484      0.24396430      1.43926077 
# (varimps_cond <- sort(varimp(cf_1, cores=10, conditional=TRUE)))

The standard variable importance scores are very similar to those we saw from randomForest: POR_ANIMACY and length-based predictors led by LEN_PORmPUM_LOG are the strongest predictors. The conditional variable importance scores cannot be computed, at least not on hardware I have available (a fast AMD Ryzen 9 3900 with 24 threads and 64 GB of RAM; after more than 12 hours, the function was still running, using 80 GB of RAM/swap memory).

What are these two predictors’ (directions of) effects? Most most likely, exactly the same as before. Here’s the result for categorical predictor POR_ANIMACY with the default plot …

(pd_a <- partial(          # make pd_a contain partial dependence scores
   object=cf_1,            # from this forest
   pred.var="POR_ANIMACY", # for this predictor
   prob=TRUE,              # return predicted probabilities
   which.class=2,          # for the 2nd level of the response
   train=d))               # these were the training data
  POR_ANIMACY       yhat
1     animate 0.53898619
2  collective 0.26978057
3   inanimate 0.04570549
4    locative 0.13798152
5    temporal 0.27685887
plot(pd_a, ylab="How much a level pushes s-genitives", ylim=c(0, 1))
abline(h=0.5, lty=2); grid() # cut-off point for decision
   text(2.5, 0.5, "cut-off"          , cex=0.75, pos=1)
abline(h=nulls[1], lty=3); grid() # baseline
   text(4.5, nulls[1], "baseline NIR", cex=0.75, pos=3)
Figure 6: Partial dependency scores for POR_ANIMACY (plot 1)

… and a more informative version:

tab_a <- table(POR_ANIMACY=d$POR_ANIMACY) # determine the frequencies of the conjunctions
barplot(main="Partial dep. of GENITIVE on POR_ANIMACY", # make a bar plot w/ this heading
   col="darkgrey",                              # grey bars
   height=pd_a$yhat,                            # whose heights are the PDP scores
   xlab="Animacy of possessor",                 # x-axis label
   ylab="Partial dependence score (for s-gen)", # y-axis label
   ylim=c(0, 1),                                # y-axis limits
   # look up ?abbreviate, which I use to make the names fit:
   names.arg=abbreviate(pd_a$POR_ANIMACY, 4),   # label the bars like this
   # make the widths of the bars represent the proportions of POR_ANIMACY
   width=prop.table(tab_a))
abline(h=0.5, lty=2); grid() # cut-off point for decision
   text(0.6, 0.5, "cut-off"          , cex=0.75, pos=1)
abline(h=nulls[1], lty=3); grid() # baseline
   text(0.8, nulls[1], "baseline NIR", cex=0.75, pos=3)
Figure 7: Partial dependency scores for POR_ANIMACY (plot 2)

On the whole, these don’t look very different from the one generated with the forest from randomForest.

Now we do the same for one numeric predictor, LEN_PORmPUM_LOG: a default version …

pd_l <- partial(               # make pd_l contain partial dependence scores
   object=cf_1,                # from this forest
   pred.var="LEN_PORmPUM_LOG", # for this predictor
   grid.resolution=10,         # provide estimates for this many predictor values
   prob=TRUE,                  # return predicted probabilities
   which.class=2,              # for the 2nd level of the response
   train=d)                    # these were the training data
pd_l # here's what we just created
plot(pd_l, ylab="How much a value pushes s-genitives", ylim=c(0, 1))
abline(h=0.5, lty=2); grid() # cut-off point for decision
   text(2.5, 0.5, "cut-off"          , cex=0.75, pos=1)
abline(h=nulls[1], lty=3); grid() # baseline
   text(4.5, nulls[1], "baseline NIR", cex=0.75, pos=3)

… and a better one:

pd_l$FREQ <- table(cut(d$LEN_PORmPUM_LOG,
   breaks=c(-5, (pd_l$LEN_PORmPUM_LOG[1:9] + pd_l$LEN_PORmPUM_LOG[2:10]) / 2, 5),
   include.lowest=TRUE))
pd_l$PROP <- prop.table(pd_l$FREQ)
plot(main="Partial dep. of GENITIVE on LEN_PORmPUM_LOG", # make a plot w/ this heading
   type="b", pch=16,                        # lines & points (filled circles)
   xlab="Possessor minus possessum length", # x-axis label
   ylab="Partial dep. score (for s-gen)",   # y-axis label
   ylim=c(0, 1),                            # y-axis limits
   x=pd_l$LEN_PORmPUM_LOG,                  # x-coords: the length differences
   y=pd_l$yhat,                             # y-coords: PDP scores
   cex=0.5+pd_l$PROP*10) # make the point sizes represent the frequencies of the length differences
   abline(v=quantile(d$LEN_PORmPUM_LOG, probs=seq(0.1, 0.9, 0.1)), col="grey", lty=3) # x-axis decile grid
   abline(h=seq(0, -3, -0.5), lty=3, col="#00000020") # y-axis grid
lines(lowess(pd_l$yhat ~ pd_l$LEN_PORmPUM_LOG), lwd=6, col="#BCBCBC80") # add a smoother
abline(h=0.5, lty=2); grid() # cut-off point for decision
   text(2, 0.5, "cut-off"          , cex=0.75, pos=1)
abline(h=nulls[1], lty=3); grid() # baseline
   text(4, nulls[1], "baseline NIR", cex=0.75, pos=3)
Figure 8: Partial dependency scores for LEN_PORmPUM_LOG (plot 2)

Same thing: not much of a difference to the previous forest, which is why I am skipping the write-up section. (Also, this smoother suffers from the same theoretical problem as the previous one, even though for these data fixing it again makes no difference.)

2 Session info

# save again because partykit is slow
save(d, cf_1, file="_output/204_09_cf1.rds")
sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  compiler
[8] methods   base

other attached packages:
[1] partykit_1.2-24 mvtnorm_1.3-3   libcoin_1.0-10  pdp_0.8.2
[5] STGmisc_1.04    Rcpp_1.1.1      magrittr_2.0.4

loaded via a namespace (and not attached):
 [1] Matrix_1.7-4         randomForest_4.7-1.2 gtable_0.3.6
 [4] jsonlite_2.0.0       dplyr_1.1.4          rpart_4.1.24
 [7] tidyselect_1.2.1     parallel_4.5.2       splines_4.5.2
[10] scales_1.4.0         yaml_2.3.12          fastmap_1.2.0
[13] lattice_0.22-7       ggplot2_4.0.1        R6_2.6.1
[16] generics_0.1.4       Formula_1.2-5        knitr_1.51
[19] iterators_1.0.14     htmlwidgets_1.6.4    MASS_7.3-65
[22] inum_1.0-5           tibble_3.3.1         pillar_1.11.1
[25] RColorBrewer_1.1-3   rlang_1.1.7          xfun_0.55
[28] S7_0.2.1             otel_0.2.0           cli_3.6.5
[31] digest_0.6.39        foreach_1.5.2        rstudioapi_0.18.0
[34] lifecycle_1.0.5      vctrs_0.7.0          evaluate_1.0.5
[37] glue_1.8.0           farver_2.1.2         codetools_0.2-20
[40] survival_3.8-6       rmarkdown_2.30       tools_4.5.2
[43] pkgconfig_2.0.3      htmltools_0.5.9