We are dealing with the same data set as in session 1; as a reminder, the data are in _input/genitives.csv and you can find information about the variables/columns in _input/genitives.r.
rm(list=ls(all.names=TRUE))library(magrittr); library(randomForest); library(pdp)summary(d <-read.delim( # summarize d, the result of loadingfile="_input/genitives.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
CASE GENITIVE SPEAKER MODALITY POR_LENGTH
Min. : 2 of:2720 nns:2666 spoken :1685 Min. : 1.00
1st Qu.:1006 s : 880 ns : 934 written:1915 1st Qu.: 8.00
Median :2018 Median : 11.00
Mean :2012 Mean : 14.58
3rd Qu.:3017 3rd Qu.: 17.00
Max. :4040 Max. :204.00
PUM_LENGTH POR_ANIMACY POR_FINAL_SIB POR_DEF
Min. : 2.00 animate : 920 absent :2721 definite :2349
1st Qu.: 6.00 collective: 607 present: 879 indefinite:1251
Median : 9.00 inanimate :1671
Mean : 10.35 locative : 243
3rd Qu.: 13.00 temporal : 159
Max. :109.00
We are again asking, does the choice of a genitive construction (of vs. s) vary as a function of
all the predictors that are already part of the data frame, i.e.
the categorical predictors SPEAKER, MODALITY, POR_ANIMACY, POR_FINAL_SIB, POR_DEF;
the numeric predictors POR_LENGTH and PUM_LENGTH;
an additional new length-based predictor, namely how the length of the possessor POR_LENGTH compares to the length of the possessum (PUM_LENGTH) (expressed as a difference); since such a comparison variable doesn’t exist yet in our data set, we need to create it first.
However, this session, we will deal with this using random forests.
1.1 Exploration & preparation
Some exploration of the relevant variables:
table(d$GENITIVE, d$SPEAKER) %>% addmargins
nns ns Sum
of 2024 696 2720
s 642 238 880
Sum 2666 934 3600
table(d$GENITIVE, d$MODALITY) %>% addmargins
spoken written Sum
of 1232 1488 2720
s 453 427 880
Sum 1685 1915 3600
table(d$GENITIVE, d$POR_ANIMACY) %>% addmargins
animate collective inanimate locative temporal Sum
of 370 408 1638 199 105 2720
s 550 199 33 44 54 880
Sum 920 607 1671 243 159 3600
table(d$GENITIVE, d$POR_FINAL_SIB) %>% addmargins
absent present Sum
of 1962 758 2720
s 759 121 880
Sum 2721 879 3600
table(d$GENITIVE, d$POR_DEF) %>% addmargins
definite indefinite Sum
of 1609 1111 2720
s 740 140 880
Sum 2349 1251 3600
set.seed(sum(utf8ToInt("Rotzlöffel")))(rf_1 <-randomForest(GENITIVE ~# categorical predictors SPEAKER + MODALITY + POR_ANIMACY + POR_FINAL_SIB + POR_DEF +# numeric predictors POR_LENGTH + PUM_LENGTH + LEN_PORmPUM_LOG,data=d, # the data are in dntree=1500, # how many trees to fit/how many data sets to ...replace=TRUE, # ... sample w/ replacementmtry=3, # how many variables are eligible at each splitkeep.forest=TRUE, # retain the forestkeep.inbag=TRUE, # retain which points were not OOBimportance=TRUE)) # compute importance scores
Call:
randomForest(formula = GENITIVE ~ SPEAKER + MODALITY + POR_ANIMACY + POR_FINAL_SIB + POR_DEF + POR_LENGTH + PUM_LENGTH + LEN_PORmPUM_LOG, data = d, ntree = 1500, replace = TRUE, mtry = 3, keep.forest = TRUE, keep.inbag = TRUE, importance = TRUE)
Type of random forest: classification
Number of trees: 1500
No. of variables tried at each split: 3
OOB estimate of error rate: 14.08%
Confusion matrix:
of s class.error
of 2480 240 0.08823529
s 267 613 0.30340909
How well does the forest do? Let’s compute numeric predictions (which we restrict to the second level of the response like in binary regression modeling and like in trees before) and categorical predictions:
d$PRED_PP_2 <-predict( # make d$PRED_PP_2 the result of predicting rf_1, # from rf_1type="prob")[,"s"] # predicted probabilities of "s"d$PRED_CAT <-predict(rf_1) # categorical predictions(c_m <-table( # confusion matrix: cross-tabulateOBS =d$GENITIVE, # observed orders in the rowsPREDS=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS of s
of 2480 240
s 267 613
Let’s evaluate this confusion matrix with accuracy, precision(s), and recall(s):
c( # precisions & accuracies/recalls"Class. acc."=mean(d$GENITIVE==d$PRED_CAT, na.rm=TRUE),"Prec. for s"=c_m["s","s"] /sum(c_m[ ,"s"]),"Acc./rec. for s"=c_m["s","s"] /sum(c_m["s",]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of",]))
Class. acc. Prec. for s Acc./rec. for s Prec. for of
0.8591667 0.7186401 0.6965909 0.9028031
Acc./rec. for of
0.9117647
Not bad … But what about an R-squared? Usually, people do not report R-squareds for trees or forests, but I think hose can be heuristically useful Thus, let’s add a column for the deviance calculation:
d$PREDS_PP_OBS <-abs("-"( # the absolute value of the difference d$PRED_PP_2, # the predicted prob. of the 2nd level d$GENITIVE!="s")) # 0 when obs=2nd level, 1 when obs=1st leveld$CONTRIBS2DEV <--log(d$PREDS_PP_OBS)
So, what is the deviance of the forest?
2*sum(d$CONTRIBS2DEV)
[1] Inf
Oops, why is that? Check summary(d$PREDS_PP_OBS) for the answer.
How do we deal with this? By adding to and subtracting from the predicted probabilities of 0 and 1 respectively half of the smallest difference between predicted probabilities:
Figure 1: Variable importance scores for the random forest
And what are some predictors’ (directions of) effects? We check the most powerful categorical predictor POR_ANIMACY with the default plot for partial dependency scores in Figure 2:
(pd_a <-partial( # make pd_a contain partial dependence scoresobject=rf_1, # from this forestpred.var="POR_ANIMACY", # for this predictorprob=TRUE, # return predicted probabilitieswhich.class=2, # for the 2nd level of the responsetrain=d)) # these were the training data
plot(pd_a, ylab="How much a level pushes s-genitives", ylim=c(0, 1))abline(h=0.5, lty=2); grid() # cut-off point for decisiontext(2.5, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baselinetext(4.5, nulls[1], "baseline NIR", cex=0.75, pos=3)
Figure 2: Partial dependency scores for POR_ANIMACY (plot 1)
… and a maybe more informative version in Figure 3:
tab_a <-table(POR_ANIMACY=d$POR_ANIMACY) # determine the frequencies of the conjunctionsbarplot(main="Partial dep. of GENITIVE on POR_ANIMACY", # make a bar plot w/ this headingcol="darkgrey", # grey barsheight=pd_a$yhat, # whose heights are the PDP scoresxlab="Animacy of possessor", # x-axis labelylab="Partial dependence score (for s-gen)", # y-axis labelylim=c(0, 1), # y-axis limits# look up ?abbreviate, which I use to make the names fit:names.arg=abbreviate(pd_a$POR_ANIMACY, 4), # label the bars like this# make the widths of the bars represent the proportions of POR_ANIMACYwidth=prop.table(tab_a))abline(h=0.5, lty=2); grid() # cut-off point for decisiontext(0.6, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baselinetext(0.8, nulls[1], "baseline NIR", cex=0.75, pos=3)
Figure 3: Partial dependency scores for POR_ANIMACY (plot 2)
Then we do the same for one numeric predictor, LEN_PORmPUM_LOG: a default version …
pd_l <-partial( # make pd_l contain partial dependence scoresobject=rf_1, # from this forestpred.var="LEN_PORmPUM_LOG", # for this predictorgrid.resolution=10, # provide estimates for this many predictor valuesprob=TRUE, # return predicted probabilitieswhich.class=2, # for the 2nd level of the responsetrain=d) # these were the training datapd_l # here's what we just created
plot(pd_l, ylab="How much a value pushes s-genitives", ylim=c(0, 1))abline(h=0.5, lty=2); grid() # cut-off point for decisiontext(2.5, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baselinetext(4.5, nulls[1], "baseline NIR", cex=0.75, pos=3)
Figure 4: Partial dependency scores for LEN_PORmPUM_LOG (plot 1)
… and a better one:
pd_l$FREQ <-table(cut(d$LEN_PORmPUM_LOG,breaks=c(-5, (pd_l$LEN_PORmPUM_LOG[1:9] + pd_l$LEN_PORmPUM_LOG[2:10]) /2, 5),include.lowest=TRUE))pd_l$PROP <-prop.table(pd_l$FREQ)plot(main="Partial dep. of GENITIVE on LEN_PORmPUM_LOG", # make a plot w/ this headingtype="b", pch=16, # lines & points (filled circles)xlab="Possessor minus possessum length", # x-axis labelylab="Partial dep. score (for s-gen)", # y-axis labelylim=c(0, 1), # y-axis limitsx=pd_l$LEN_PORmPUM_LOG, # x-coords: the length differencesy=pd_l$yhat, # y-coords: PDP scorescex=0.5+pd_l$PROP*10) # make the point sizes represent the frequencies of the length differencesabline(v=quantile(d$LEN_PORmPUM_LOG, probs=seq(0.1, 0.9, 0.1)), col="grey", lty=3) # x-axis decile gridabline(h=seq(0, -3, -0.5), lty=3, col="#00000020") # y-axis gridlines(lowess(pd_l$yhat ~ pd_l$LEN_PORmPUM_LOG), lwd=6, col="#BCBCBC80") # add a smootherabline(h=0.5, lty=2); grid() # cut-off point for decisiontext(2, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baselinetext(4, nulls[1], "baseline NIR", cex=0.75, pos=3)
Figure 5: Partial dependency scores for LEN_PORmPUM_LOG (plot 2)
Test question for you: why might this smoother be potentially problematic (potentially, here it’s fine) and how might one fix that?
Of course, one might ideally also check for interactions …
1.3.3 Write-up
To determine whether the choice of a genitive construction (of vs. s) varies as a function of
the categorical predictors SPEAKER, MODALITY, POR_ANIMACY, POR_FINAL_SIB, POR_DEF;
the numeric predictors POR_LENGTH and PUM_LENGTH and the logged length difference between them (LEN_PORmPUM_LOG),
a random forest was fit with the following hyperparameters: ntree was set to 1500 and mtry to 3. The results indicate that the genitive alternation is predictable fairly well: While the no-information rate is 75.56% (for of), the overall accuracy of the forest is 85.9%, with separate precision and recall values for each genitive as follows: [quote precision and recall values]. The forest came with a reduction in deviance of the response of (a McFadden’s R2 of) 0.302.
Variable importance scores (the combination of mean decrease of Gini and of accuracy) indicate that two main kinds of predictors play the most important roles: the animacy of the possessor and values related to the lengths of the possessor and the possessed; of the latter, the logged lengths difference was most important. [show plot] The effect of POR_ANIMACY is that
there is a very strong preference for of-genitives with inanimate possessors;
there are weak preferences for of-genitives with collective, locative, and temporal possessors;
there is a weak preference for s-genitives with animate possessors.
The effect of the length difference variable boils down to a simple (and expected) short-before-long effect.
1.4 Conditional inference forests
Let’s also see whether a conditional inference forests leads to different results (not very likely, but possible, esp. with regard to variable importances):
d <- d[,1:10] # restore d to what we loadeddetach(package:randomForest); library(partykit)
1.4.1 Creating & evaluating a conditional inference forest
Here’s how we fit a conditional inference forest; in general the syntax is very similar, but there are a few tiny changes:
set.seed(sum(utf8ToInt("Gummibären"))) # set a replicable random number seedcf_1 <-cforest(GENITIVE ~# categorical predictors SPEAKER + MODALITY + POR_ANIMACY + POR_FINAL_SIB + POR_DEF +# numeric predictors POR_LENGTH + PUM_LENGTH + LEN_PORmPUM_LOG,data=d,ntree=1500, # how many trees to fit/how many data sets to ...perturb=list(replace=TRUE), # ... sample w/ replacementmtry=3) # how many variables are eligible at each split
We’ll proceed exactly as before: we first compute numeric and categorical predictions, then we evaluate the resulting confusion matrix:
d$PRED_PP_2 <-predict( # make d$PRED_PP_2 the result of predicting cf_1, # from cf_1type="prob")[,"s"] # predicted probabilities of "s"d$PRED_CAT <-predict(rf_1) # categorical predictions(c_m <-table( # confusion matrix: cross-tabulateOBS =d$GENITIVE, # observed orders in the rowsPREDS=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS of s
of 2480 240
s 267 613
c( # precisions & accuracies/recalls"Class. acc."=mean(d$GENITIVE==d$PRED_CAT, na.rm=TRUE),"Prec. for s"=c_m["s","s"] /sum(c_m[ ,"s"]),"Acc./rec. for s"=c_m["s","s"] /sum(c_m["s",]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of",]))
Class. acc. Prec. for s Acc./rec. for s Prec. for of
0.8591667 0.7186401 0.6965909 0.9028031
Acc./rec. for of
0.9117647
d$PREDS_PP_OBS <-abs("-"( # the absolute value of the difference d$PRED_PP_2, # the predicted prob. of the 2nd level d$GENITIVE!="s")) # 0 when obs=2nd level, 1 when obs=1st level# calculate the offset and ...offset <-unique(d$PREDS_PP_OBS) %>% sort %>% diff %>% min %>%"/"(2)# apply it to the extreme values:d$PREDS_PP_OBS[d$PREDS_PP_OBS==0] <- offsetd$PREDS_PP_OBS[d$PREDS_PP_OBS==1] <-1-offset# compute contributions to deviance and ...d$CONTRIBS2DEV <--log(d$PREDS_PP_OBS)# the deviance2*sum(d$CONTRIBS2DEV)
[1] 1693.867
# which leads to this McFadden R-squared:(nulls[3] -2*sum(d$CONTRIBS2DEV)) / nulls[3]
null deviance
0.5769852
# save because partykit is slowsave(d, cf_1, file="_output/204_09_cf1.rds")
1.4.2 Interpreting the conditional inference forest
Which variables are most important for the predictions?
(varimps_reg <-sort(varimp(cf_1, cores=10))) # on non-Windoze, delete "cores=10"
The standard variable importance scores are very similar to those we saw from randomForest: POR_ANIMACY and length-based predictors led by LEN_PORmPUM_LOG are the strongest predictors. The conditional variable importance scores cannot be computed, at least not on hardware I have available (a fast AMD Ryzen 9 3900 with 24 threads and 64 GB of RAM; after more than 12 hours, the function was still running, using 80 GB of RAM/swap memory).
What are these two predictors’ (directions of) effects? Most most likely, exactly the same as before. Here’s the result for categorical predictor POR_ANIMACY with the default plot …
(pd_a <-partial( # make pd_a contain partial dependence scoresobject=cf_1, # from this forestpred.var="POR_ANIMACY", # for this predictorprob=TRUE, # return predicted probabilitieswhich.class=2, # for the 2nd level of the responsetrain=d)) # these were the training data
plot(pd_a, ylab="How much a level pushes s-genitives", ylim=c(0, 1))abline(h=0.5, lty=2); grid() # cut-off point for decisiontext(2.5, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baselinetext(4.5, nulls[1], "baseline NIR", cex=0.75, pos=3)
Figure 6: Partial dependency scores for POR_ANIMACY (plot 1)
… and a more informative version:
tab_a <-table(POR_ANIMACY=d$POR_ANIMACY) # determine the frequencies of the conjunctionsbarplot(main="Partial dep. of GENITIVE on POR_ANIMACY", # make a bar plot w/ this headingcol="darkgrey", # grey barsheight=pd_a$yhat, # whose heights are the PDP scoresxlab="Animacy of possessor", # x-axis labelylab="Partial dependence score (for s-gen)", # y-axis labelylim=c(0, 1), # y-axis limits# look up ?abbreviate, which I use to make the names fit:names.arg=abbreviate(pd_a$POR_ANIMACY, 4), # label the bars like this# make the widths of the bars represent the proportions of POR_ANIMACYwidth=prop.table(tab_a))abline(h=0.5, lty=2); grid() # cut-off point for decisiontext(0.6, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baselinetext(0.8, nulls[1], "baseline NIR", cex=0.75, pos=3)
Figure 7: Partial dependency scores for POR_ANIMACY (plot 2)
On the whole, these don’t look very different from the one generated with the forest from randomForest.
Now we do the same for one numeric predictor, LEN_PORmPUM_LOG: a default version …
pd_l <- partial( # make pd_l contain partial dependence scores object=cf_1, # from this forest pred.var="LEN_PORmPUM_LOG", # for this predictor grid.resolution=10, # provide estimates for this many predictor values prob=TRUE, # return predicted probabilities which.class=2, # for the 2nd level of the response train=d) # these were the training datapd_l # here's what we just createdplot(pd_l, ylab="How much a value pushes s-genitives", ylim=c(0, 1))abline(h=0.5, lty=2); grid() # cut-off point for decision text(2.5, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baseline text(4.5, nulls[1], "baseline NIR", cex=0.75, pos=3)
… and a better one:
pd_l$FREQ <-table(cut(d$LEN_PORmPUM_LOG,breaks=c(-5, (pd_l$LEN_PORmPUM_LOG[1:9] + pd_l$LEN_PORmPUM_LOG[2:10]) /2, 5),include.lowest=TRUE))pd_l$PROP <-prop.table(pd_l$FREQ)plot(main="Partial dep. of GENITIVE on LEN_PORmPUM_LOG", # make a plot w/ this headingtype="b", pch=16, # lines & points (filled circles)xlab="Possessor minus possessum length", # x-axis labelylab="Partial dep. score (for s-gen)", # y-axis labelylim=c(0, 1), # y-axis limitsx=pd_l$LEN_PORmPUM_LOG, # x-coords: the length differencesy=pd_l$yhat, # y-coords: PDP scorescex=0.5+pd_l$PROP*10) # make the point sizes represent the frequencies of the length differencesabline(v=quantile(d$LEN_PORmPUM_LOG, probs=seq(0.1, 0.9, 0.1)), col="grey", lty=3) # x-axis decile gridabline(h=seq(0, -3, -0.5), lty=3, col="#00000020") # y-axis gridlines(lowess(pd_l$yhat ~ pd_l$LEN_PORmPUM_LOG), lwd=6, col="#BCBCBC80") # add a smootherabline(h=0.5, lty=2); grid() # cut-off point for decisiontext(2, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baselinetext(4, nulls[1], "baseline NIR", cex=0.75, pos=3)
Figure 8: Partial dependency scores for LEN_PORmPUM_LOG (plot 2)
Same thing: not much of a difference to the previous forest, which is why I am skipping the write-up section. (Also, this smoother suffers from the same theoretical problem as the previous one, even though for these data fixing it again makes no difference.)
2 Session info
# save again because partykit is slowsave(d, cf_1, file="_output/204_09_cf1.rds")sessionInfo()
---title: "Ling 204: session 09"author: - name: "[Stefan Th. Gries](https://www.stgries.info)" affiliation: - UC Santa Barbara - JLU Giessen orcid: 0000-0002-6497-3958date: "2026-03-04 12:34:56 PDT"date-format: "DD MMM YYYY HH-mm-ss"editor: sourceformat: html: page-layout: full code-fold: false code-link: true code-copy: true code-tools: true code-line-numbers: true code-overflow: scroll number-sections: true smooth-scroll: true toc: true toc-depth: 4 number-depth: 4 toc-location: left monofont: lucida console tbl-cap-location: top fig-cap-location: bottom fig-width: 4 fig-height: 4 fig-format: png fig-dpi: 300 fig-align: center embed-resources: true link-external-newwindow: trueexecute: cache: false echo: true eval: true warning: false---# Session 09: Random forests {#sec-s09}We are dealing with the same data set as in session 1; as a reminder, the data are in [_input/genitives.csv](_input/genitives.csv) and you can find information about the variables/columns in [_input/genitives.r](_input/genitives.r).```{r}#| message: falserm(list=ls(all.names=TRUE))library(magrittr); library(randomForest); library(pdp)summary(d <-read.delim( # summarize d, the result of loadingfile="_input/genitives.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors```We are again asking, does the choice of a genitive construction (*of* vs. *s*) vary as a function of* all the predictors that are already part of the data frame, i.e. + the categorical predictors `SPEAKER`, `MODALITY`, `POR_ANIMACY`, `POR_FINAL_SIB`, `POR_DEF`; + the numeric predictors `POR_LENGTH` and `PUM_LENGTH`;* an additional new length-based predictor, namely how the length of the possessor `POR_LENGTH` compares to the length of the possessum (`PUM_LENGTH`) (expressed as a difference); since such a comparison variable doesn't exist yet in our data set, we need to create it first.However, this session, we will deal with this using random forests.## Exploration & preparationSome exploration of the relevant variables:```{r}table(d$GENITIVE, d$SPEAKER) %>% addmarginstable(d$GENITIVE, d$MODALITY) %>% addmarginstable(d$GENITIVE, d$POR_ANIMACY) %>% addmarginstable(d$GENITIVE, d$POR_FINAL_SIB) %>% addmarginstable(d$GENITIVE, d$POR_DEF) %>% addmarginshist(d$LEN_PORmPUM_LOG <-"-"(log2(d$POR_LENGTH),log2(d$PUM_LENGTH)),main="", xlab="LEN_PORmPUM_LOG")```## Deviance & baseline(s)Let's already compute all baseline value for what will be the response variable, `GENITIVE`:```{r}(nulls <-setNames(c(d$GENITIVE %>% table %>% prop.table %>% max, d$GENITIVE %>% table %>% prop.table %>%"^"(2) %>% sum,deviance(glm(GENITIVE ~1, family=binomial, data=d)),logLik(glm(GENITIVE ~1, family=binomial, data=d))),c("no-info rate", "proportional guessing", "null deviance", "null logLik")))```## Random forests### Creating & evaluating a random forestHow about we fit a 'regular' random forest?```{r}set.seed(sum(utf8ToInt("Rotzlöffel")))(rf_1 <-randomForest(GENITIVE ~# categorical predictors SPEAKER + MODALITY + POR_ANIMACY + POR_FINAL_SIB + POR_DEF +# numeric predictors POR_LENGTH + PUM_LENGTH + LEN_PORmPUM_LOG,data=d, # the data are in dntree=1500, # how many trees to fit/how many data sets to ...replace=TRUE, # ... sample w/ replacementmtry=3, # how many variables are eligible at each splitkeep.forest=TRUE, # retain the forestkeep.inbag=TRUE, # retain which points were not OOBimportance=TRUE)) # compute importance scores```How well does the forest do? Let's compute numeric predictions (which we restrict to the second level of the response like in binary regression modeling and like in trees before) and categorical predictions:```{r}d$PRED_PP_2 <-predict( # make d$PRED_PP_2 the result of predicting rf_1, # from rf_1type="prob")[,"s"] # predicted probabilities of "s"d$PRED_CAT <-predict(rf_1) # categorical predictions(c_m <-table( # confusion matrix: cross-tabulateOBS =d$GENITIVE, # observed orders in the rowsPREDS=d$PRED_CAT)) # predicted orders in the columns```Let's evaluate this confusion matrix with accuracy, precision(s), and recall(s):```{r}c( # precisions & accuracies/recalls"Class. acc."=mean(d$GENITIVE==d$PRED_CAT, na.rm=TRUE),"Prec. for s"=c_m["s","s"] /sum(c_m[ ,"s"]),"Acc./rec. for s"=c_m["s","s"] /sum(c_m["s",]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of",]))```We also compute Cohen's κ and the *C*-score:```{r}c("Cohen's kappa"=cohens.kappa(c_m)[[1]],"C"=C.score(cv.f=d$GENITIVE, d$PRED_PP_2))```Not bad ... But what about an R-squared? Usually, people do not report R-squareds for trees or forests, but I think hose can be heuristically useful Thus, let's add a column for the deviance calculation:```{r}d$PREDS_PP_OBS <-abs("-"( # the absolute value of the difference d$PRED_PP_2, # the predicted prob. of the 2nd level d$GENITIVE!="s")) # 0 when obs=2nd level, 1 when obs=1st leveld$CONTRIBS2DEV <--log(d$PREDS_PP_OBS)```So, what is the deviance of the forest?```{r}2*sum(d$CONTRIBS2DEV)```Oops, why is that? Check `summary(d$PREDS_PP_OBS)` for the answer.How do we deal with this? By adding to and subtracting from the predicted probabilities of 0 and 1 respectively half of the smallest difference between predicted probabilities:```{r}offset <-unique(d$PREDS_PP_OBS) %>% sort %>% diff %>% min %>%"/"(2)d$PREDS_PP_OBS[d$PREDS_PP_OBS==0] <- offsetd$PREDS_PP_OBS[d$PREDS_PP_OBS==1] <-1-offset# d$PREDS_PP_OBS <- min2max(d$PREDS_PP_OBS, minim=offset, maxim=1-offset)```Now this should work ...```{r}d$CONTRIBS2DEV <--log(d$PREDS_PP_OBS)2*sum(d$CONTRIBS2DEV)```Thus, we can compute McFadden's R-squared:```{r}(nulls[3] -2*sum(d$CONTRIBS2DEV)) / nulls[3]```I leave it up to you to also compute Nagelkerke's R-squared from the null model's/forest's log likelihood and this forests's log likelihood.```{r}#| echo: false#| eval: false"/"(1-exp( (2793.659-4004.273) /sum(c_m)),1-exp( -4004.273/sum(c_m)))```### Interpreting the random forestWhich variables are most important for the predictions?```{r}#| label: fig-randfor_varimp#| fig-cap: Variable importance scores for the random forest#| fig-width: 5#| fig-height: 5varimps <-importance(rf_1)plot(pch=16, col="#00000040",xlab="Mean decr accuracy", xlim=c(0, 350), x=varimps[,3],ylab="Mean decr Gini" , ylim=c(0, 450), y=varimps[,4])grid(); lines(par("usr")[1:2], par("usr")[3:4], lty=3)text(varimps[,3], varimps[,4], rownames(varimps))```And what are some predictors' (directions of) effects? We check the most powerful categorical predictor `POR_ANIMACY` with the default plot for partial dependency scores in @fig-randfor_pdp_poranim1:```{r}#| label: fig-randfor_pdp_poranim1#| fig-cap: Partial dependency scores for POR_ANIMACY (plot 1)(pd_a <-partial( # make pd_a contain partial dependence scoresobject=rf_1, # from this forestpred.var="POR_ANIMACY", # for this predictorprob=TRUE, # return predicted probabilitieswhich.class=2, # for the 2nd level of the responsetrain=d)) # these were the training dataplot(pd_a, ylab="How much a level pushes s-genitives", ylim=c(0, 1))abline(h=0.5, lty=2); grid() # cut-off point for decisiontext(2.5, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baselinetext(4.5, nulls[1], "baseline NIR", cex=0.75, pos=3)```... and a maybe more informative version in @fig-randfor_pdp_poranim2:```{r}#| label: fig-randfor_pdp_poranim2#| fig-cap: Partial dependency scores for POR_ANIMACY (plot 2)#| fig-width: 5#| fig-height: 5tab_a <-table(POR_ANIMACY=d$POR_ANIMACY) # determine the frequencies of the conjunctionsbarplot(main="Partial dep. of GENITIVE on POR_ANIMACY", # make a bar plot w/ this headingcol="darkgrey", # grey barsheight=pd_a$yhat, # whose heights are the PDP scoresxlab="Animacy of possessor", # x-axis labelylab="Partial dependence score (for s-gen)", # y-axis labelylim=c(0, 1), # y-axis limits# look up ?abbreviate, which I use to make the names fit:names.arg=abbreviate(pd_a$POR_ANIMACY, 4), # label the bars like this# make the widths of the bars represent the proportions of POR_ANIMACYwidth=prop.table(tab_a))abline(h=0.5, lty=2); grid() # cut-off point for decisiontext(0.6, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baselinetext(0.8, nulls[1], "baseline NIR", cex=0.75, pos=3)```Then we do the same for one numeric predictor, `LEN_PORmPUM_LOG`: a default version ...```{r}#| label: fig-randfor_pdp_lendiff1#| fig-cap: Partial dependency scores for LEN_PORmPUM_LOG (plot 1)pd_l <-partial( # make pd_l contain partial dependence scoresobject=rf_1, # from this forestpred.var="LEN_PORmPUM_LOG", # for this predictorgrid.resolution=10, # provide estimates for this many predictor valuesprob=TRUE, # return predicted probabilitieswhich.class=2, # for the 2nd level of the responsetrain=d) # these were the training datapd_l # here's what we just createdplot(pd_l, ylab="How much a value pushes s-genitives", ylim=c(0, 1))abline(h=0.5, lty=2); grid() # cut-off point for decisiontext(2.5, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baselinetext(4.5, nulls[1], "baseline NIR", cex=0.75, pos=3)```... and a better one:```{r}#| label: fig-randfor_pdp_lendiff2#| fig-cap: Partial dependency scores for LEN_PORmPUM_LOG (plot 2)#| fig-width: 5#| fig-height: 5pd_l$FREQ <-table(cut(d$LEN_PORmPUM_LOG,breaks=c(-5, (pd_l$LEN_PORmPUM_LOG[1:9] + pd_l$LEN_PORmPUM_LOG[2:10]) /2, 5),include.lowest=TRUE))pd_l$PROP <-prop.table(pd_l$FREQ)plot(main="Partial dep. of GENITIVE on LEN_PORmPUM_LOG", # make a plot w/ this headingtype="b", pch=16, # lines & points (filled circles)xlab="Possessor minus possessum length", # x-axis labelylab="Partial dep. score (for s-gen)", # y-axis labelylim=c(0, 1), # y-axis limitsx=pd_l$LEN_PORmPUM_LOG, # x-coords: the length differencesy=pd_l$yhat, # y-coords: PDP scorescex=0.5+pd_l$PROP*10) # make the point sizes represent the frequencies of the length differencesabline(v=quantile(d$LEN_PORmPUM_LOG, probs=seq(0.1, 0.9, 0.1)), col="grey", lty=3) # x-axis decile gridabline(h=seq(0, -3, -0.5), lty=3, col="#00000020") # y-axis gridlines(lowess(pd_l$yhat ~ pd_l$LEN_PORmPUM_LOG), lwd=6, col="#BCBCBC80") # add a smootherabline(h=0.5, lty=2); grid() # cut-off point for decisiontext(2, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baselinetext(4, nulls[1], "baseline NIR", cex=0.75, pos=3)```Test question for you: why might this smoother be potentially problematic (*potentially*, here it's fine) and how might one fix that?```{r}#| echo: false#| eval: false# The smoother weights all predictions equally: it doesn't take into consideration that the middle *x*-axis values are much more frequent than the more extreme ones on the outsides. This would probably be better:lines(lowess(rep(pd_l$yhat, pd_l$FREQ) ~rep(pd_l$LEN_PORmPUM_LOG, pd_l$FREQ)),lwd=6, col="red")```Of course, one might ideally also check for interactions ...### Write-upTo determine whether the choice of a genitive construction (*of* vs. *s*) varies as a function of* the categorical predictors `SPEAKER`, `MODALITY`, `POR_ANIMACY`, `POR_FINAL_SIB`, `POR_DEF`;* the numeric predictors `POR_LENGTH` and `PUM_LENGTH` and the logged length difference between them (`LEN_PORmPUM_LOG`),a random forest was fit with the following hyperparameters: `ntree` was set to 1500 and `mtry` to 3. The results indicate that the genitive alternation is predictable fairly well: While the no-information rate is 75.56% (for *of*), the overall accuracy of the forest is 85.9%, with separate precision and recall values for each genitive as follows: [quote precision and recall values]. The forest came with a reduction in deviance of the response of (a McFadden's *R*^2^ of) 0.302.Variable importance scores (the combination of mean decrease of Gini and of accuracy) indicate that two main kinds of predictors play the most important roles: the animacy of the possessor and values related to the lengths of the possessor and the possessed; of the latter, the logged lengths difference was most important. [show plot] The effect of `POR_ANIMACY` is that* there is a very strong preference for *of*-genitives with inanimate possessors;* there are weak preferences for *of*-genitives with collective, locative, and temporal possessors;* there is a weak preference for *s*-genitives with animate possessors.The effect of the length difference variable boils down to a simple (and expected) short-before-long effect.## Conditional inference forestsLet's also see whether a conditional inference forests leads to different results (not very likely, but possible, esp. with regard to variable importances):```{r}#| message: falsed <- d[,1:10] # restore d to what we loadeddetach(package:randomForest); library(partykit)```### Creating & evaluating a conditional inference forestHere's how we fit a conditional inference forest; in general the syntax is very similar, but there are a few tiny changes:```{r condinffor}set.seed(sum(utf8ToInt("Gummibären"))) # set a replicable random number seedcf_1 <- cforest(GENITIVE ~ # categorical predictors SPEAKER + MODALITY + POR_ANIMACY + POR_FINAL_SIB + POR_DEF + # numeric predictors POR_LENGTH + PUM_LENGTH + LEN_PORmPUM_LOG, data=d, ntree=1500, # how many trees to fit/how many data sets to ... perturb=list(replace=TRUE), # ... sample w/ replacement mtry=3) # how many variables are eligible at each split```We'll proceed exactly as before: we first compute numeric and categorical predictions, then we evaluate the resulting confusion matrix:```{r condinfforpreds}d$PRED_PP_2 <- predict( # make d$PRED_PP_2 the result of predicting cf_1, # from cf_1 type="prob")[,"s"] # predicted probabilities of "s"d$PRED_CAT <- predict(rf_1) # categorical predictions(c_m <- table( # confusion matrix: cross-tabulate OBS =d$GENITIVE, # observed orders in the rows PREDS=d$PRED_CAT)) # predicted orders in the columnsc( # precisions & accuracies/recalls "Class. acc."=mean(d$GENITIVE==d$PRED_CAT, na.rm=TRUE), "Prec. for s" =c_m["s","s"] / sum(c_m[ ,"s"]), "Acc./rec. for s" =c_m["s","s"] / sum(c_m["s",]), "Prec. for of" =c_m["of","of"] / sum(c_m[ ,"of"]), "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of",]))c("Cohen's kappa"=cohens.kappa(c_m)[[1]], "C"=C.score(cv.f=d$GENITIVE, d$PRED_PP_2))d$PREDS_PP_OBS <- abs("-"( # the absolute value of the difference d$PRED_PP_2, # the predicted prob. of the 2nd level d$GENITIVE!="s")) # 0 when obs=2nd level, 1 when obs=1st level# calculate the offset and ...offset <- unique(d$PREDS_PP_OBS) %>% sort %>% diff %>% min %>% "/"(2)# apply it to the extreme values:d$PREDS_PP_OBS[d$PREDS_PP_OBS==0] <- offsetd$PREDS_PP_OBS[d$PREDS_PP_OBS==1] <- 1-offset# compute contributions to deviance and ...d$CONTRIBS2DEV <- -log(d$PREDS_PP_OBS)# the deviance2*sum(d$CONTRIBS2DEV)# which leads to this McFadden R-squared:(nulls[3] - 2*sum(d$CONTRIBS2DEV)) / nulls[3]# save because partykit is slowsave(d, cf_1, file="_output/204_09_cf1.rds")```### Interpreting the conditional inference forestWhich variables are most important for the predictions?```{r condinfforvarimp}(varimps_reg <- sort(varimp(cf_1, cores=10))) # on non-Windoze, delete "cores=10"# (varimps_cond <- sort(varimp(cf_1, cores=10, conditional=TRUE)))```The standard variable importance scores are very similar to those we saw from `randomForest`: `POR_ANIMACY` and length-based predictors led by `LEN_PORmPUM_LOG` are the strongest predictors. The conditional variable importance scores cannot be computed, at least not on hardware I have available (a fast AMD Ryzen 9 3900 with 24 threads and 64 GB of RAM; after more than 12 hours, the function was still running, using 80 GB of RAM/swap memory).What are these two predictors' (directions of) effects? Most most likely, exactly the same as before. Here's the result for categorical predictor `POR_ANIMACY` with the default plot ...```{r condinfforpartialporanim}#| label: fig-condfor_pdp_poranim1#| fig-cap: Partial dependency scores for POR_ANIMACY (plot 1)(pd_a <- partial( # make pd_a contain partial dependence scores object=cf_1, # from this forest pred.var="POR_ANIMACY", # for this predictor prob=TRUE, # return predicted probabilities which.class=2, # for the 2nd level of the response train=d)) # these were the training dataplot(pd_a, ylab="How much a level pushes s-genitives", ylim=c(0, 1))abline(h=0.5, lty=2); grid() # cut-off point for decision text(2.5, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baseline text(4.5, nulls[1], "baseline NIR", cex=0.75, pos=3)```... and a more informative version:```{r}#| label: fig-condfor_pdp_poranim2#| fig-cap: Partial dependency scores for POR_ANIMACY (plot 2)#| fig-width: 5#| fig-height: 5tab_a <-table(POR_ANIMACY=d$POR_ANIMACY) # determine the frequencies of the conjunctionsbarplot(main="Partial dep. of GENITIVE on POR_ANIMACY", # make a bar plot w/ this headingcol="darkgrey", # grey barsheight=pd_a$yhat, # whose heights are the PDP scoresxlab="Animacy of possessor", # x-axis labelylab="Partial dependence score (for s-gen)", # y-axis labelylim=c(0, 1), # y-axis limits# look up ?abbreviate, which I use to make the names fit:names.arg=abbreviate(pd_a$POR_ANIMACY, 4), # label the bars like this# make the widths of the bars represent the proportions of POR_ANIMACYwidth=prop.table(tab_a))abline(h=0.5, lty=2); grid() # cut-off point for decisiontext(0.6, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baselinetext(0.8, nulls[1], "baseline NIR", cex=0.75, pos=3)```On the whole, these don't look very different from the one generated with the forest from `randomForest`.Now we do the same for one numeric predictor, `LEN_PORmPUM_LOG`: a default version ...```{rcondinfforpartiallendiff}#| label: fig-condfor_pdp_lendiff1#| fig-cap: Partial dependency scores for LEN_PORmPUM_LOG (plot 1)pd_l <- partial( # make pd_l contain partial dependence scores object=cf_1, # from this forest pred.var="LEN_PORmPUM_LOG", # for this predictor grid.resolution=10, # provide estimates for this many predictor values prob=TRUE, # return predicted probabilities which.class=2, # for the 2nd level of the response train=d) # these were the training datapd_l # here's what we just createdplot(pd_l, ylab="How much a value pushes s-genitives", ylim=c(0, 1))abline(h=0.5, lty=2); grid() # cut-off point for decision text(2.5, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baseline text(4.5, nulls[1], "baseline NIR", cex=0.75, pos=3)```... and a better one:```{r}#| label: fig-condfor_pdp_lendiff2#| fig-cap: Partial dependency scores for LEN_PORmPUM_LOG (plot 2)#| fig-width: 5#| fig-height: 5pd_l$FREQ <-table(cut(d$LEN_PORmPUM_LOG,breaks=c(-5, (pd_l$LEN_PORmPUM_LOG[1:9] + pd_l$LEN_PORmPUM_LOG[2:10]) /2, 5),include.lowest=TRUE))pd_l$PROP <-prop.table(pd_l$FREQ)plot(main="Partial dep. of GENITIVE on LEN_PORmPUM_LOG", # make a plot w/ this headingtype="b", pch=16, # lines & points (filled circles)xlab="Possessor minus possessum length", # x-axis labelylab="Partial dep. score (for s-gen)", # y-axis labelylim=c(0, 1), # y-axis limitsx=pd_l$LEN_PORmPUM_LOG, # x-coords: the length differencesy=pd_l$yhat, # y-coords: PDP scorescex=0.5+pd_l$PROP*10) # make the point sizes represent the frequencies of the length differencesabline(v=quantile(d$LEN_PORmPUM_LOG, probs=seq(0.1, 0.9, 0.1)), col="grey", lty=3) # x-axis decile gridabline(h=seq(0, -3, -0.5), lty=3, col="#00000020") # y-axis gridlines(lowess(pd_l$yhat ~ pd_l$LEN_PORmPUM_LOG), lwd=6, col="#BCBCBC80") # add a smootherabline(h=0.5, lty=2); grid() # cut-off point for decisiontext(2, 0.5, "cut-off" , cex=0.75, pos=1)abline(h=nulls[1], lty=3); grid() # baselinetext(4, nulls[1], "baseline NIR", cex=0.75, pos=3)```Same thing: not much of a difference to the previous forest, which is why I am skipping the write-up section. (Also, this smoother suffers from the same theoretical problem as the previous one, even though for these data fixing it again makes no difference.)```{r}#| echo: false#| eval: falselines(lowess(rep(pd_l$yhat, pd_l$FREQ) ~rep(pd_l$LEN_PORmPUM_LOG, pd_l$FREQ)),lwd=6, col="red")```# Session info```{r}# save again because partykit is slowsave(d, cf_1, file="_output/204_09_cf1.rds")sessionInfo()```