Ling 105: session 04: bin. logistic regr. 1 (key)

Author
Affiliations

UC Santa Barbara

JLU Giessen

Published

22 Apr 2024 12-34-56

1 Introduction

Let’s load a data set for a few binary logistic regression modeling examples; the data are in _input/genitives.csv and you can find information about the variables/columns in _input/genitives.r.

rm(list=ls(all.names=TRUE))
library(car); library(effects); library(magrittr); library(multcomp)
source("_helpers/R2s.r"); source("_helpers/C.score.r"); source("_helpers/cohens.kappa.r")
summary(d <- read.delim(        # summarize d, the result of loading
   file="_input/genitives.csv", # this file
   stringsAsFactors=TRUE))      # change categorical variables into factors
      CASE      GENITIVE  SPEAKER       MODALITY      POR_LENGTH
 Min.   :   2   of:2720   nns:2666   spoken :1685   Min.   :  1.00
 1st Qu.:1006   s : 880   ns : 934   written:1915   1st Qu.:  8.00
 Median :2018                                       Median : 11.00
 Mean   :2012                                       Mean   : 14.58
 3rd Qu.:3017                                       3rd Qu.: 17.00
 Max.   :4040                                       Max.   :204.00
   PUM_LENGTH         POR_ANIMACY   POR_FINAL_SIB        POR_DEF
 Min.   :  2.00   animate   : 920   absent :2721   definite  :2349
 1st Qu.:  6.00   collective: 607   present: 879   indefinite:1251
 Median :  9.00   inanimate :1671
 Mean   : 10.35   locative  : 243
 3rd Qu.: 13.00   temporal  : 159
 Max.   :109.00                                                     

As in the book, we will first discuss some didactically motivated monofactorial models before we go over a ‘proper’ multifactorial modeling approach.

2 Deviance & baseline(s)

Let’s already compute the baselines for what will be the response variable in all case studies, GENITIVE:

(baselines <- c(
   "baseline 1"=max(          # make baselines[1] the highest
      prop.table(             # proportion in the
         table(d$GENITIVE))), # frequency table of the response
   "baseline 2"=sum(             # make baselines[2] the sum of the
      prop.table(                # proportions in the
         table(d$GENITIVE))^2))) # frequency table of the response squared
baseline 1 baseline 2
 0.7555556  0.6306173 

As for deviance, in the case of lm, we defined deviance as the sum of squared residuals and we said we want to reduce the deviance of our model relative to a null model by adding (hopefully good) predictors. What is deviance here in the case of a glm? To understand that, we first compute a null model and then, from that, the deviance:

summary(m_00 <- glm(GENITIVE ~ 1, family=binomial, data=d, na.action=na.exclude))

Call:
glm(formula = GENITIVE ~ 1, family = binomial, data = d, na.action = na.exclude)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.12847    0.03878   -29.1   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4004.3  on 3599  degrees of freedom
Residual deviance: 4004.3  on 3599  degrees of freedom
AIC: 4006.3

Number of Fisher Scoring iterations: 4
deviance(m_00)
[1] 4004.273
sum(residuals(m_00)^2)
[1] 4004.273

We see that the deviance of m_00 is still the sum of this model’s squared residuals and the model’s “null deviance” (and, if a model is a null model, also what is called “residual deviance”). But what are squared residuals here? We will discuss that in more detail after we have fitted our first ‘real model’ …

3 A binary predictor

Does the choice of a genitive construction vary as a function of the modality in which the construction was produced (spoken vs. written)?

3.1 Exploration & preparation

Some exploration of the relevant variables:

# the predictor(s)/response on its/their own
table(d$GENITIVE)

  of    s
2720  880 
table(d$MODALITY)

 spoken written
   1685    1915 
# the predictor(s) w/ the response
(mod_x_gen <- table(d$MODALITY, d$GENITIVE))

            of    s
  spoken  1232  453
  written 1488  427

How could one approach this ‘descriptively’? We could say what the odds are for s-genitives in the spoken data …

453 / 1232 # odds for s-genitives in the spoken data
[1] 0.3676948

… and we could say what the odds are for s-genitives in the written data:

427 / 1488 # odds for s-genitives in the written data
[1] 0.2869624

But in some sense, odds are annoying because they have this very different value range for ‘I like s-genitives’ vs. ‘I don’t like s-genitives’, here shown for the spoken part:

1675 /   10 # I like s-genitives a lot
[1] 167.5
1000 /  685 # I like s-genitives somewhat
[1] 1.459854
 843 /  842 # I like s-genitives pretty much as much as of-genitives
[1] 1.001188
 685 / 1000 # I dislike s-genitives somewhat
[1] 0.685
  10 / 1675 # I don't like s-genitives at all
[1] 0.005970149

Let’s make this more symmetric by logging:

log(1675 /   10) # I like s-genitives a lot
[1] 5.120983
log(1000 /  685) # I like s-genitives somewhat
[1] 0.3783364
log( 843 /  842) # I like s-genitives pretty much as much as of-genitives
[1] 0.001186944
log( 685 / 1000) # I dislike s-genitives somewhat
[1] -0.3783364
log(  10 / 1675) # I don't like s-genitives at all
[1] -5.120983

Much better, so let’s do that for our real data:

mod_x_gen # reminder what the real data were

            of    s
  spoken  1232  453
  written 1488  427
log(453 / 1232) # LOGGED odds for s-genitives in the spoken data
[1] -1.000502
log(427 / 1488) # LOGGED odds for s-genitives in the written data
[1] -1.248404

Keep those numbers in mind. But most people would probably express this in probabilities:

addmargins(mod_x_gen) # reminder what the real data were

            of    s  Sum
  spoken  1232  453 1685
  written 1488  427 1915
  Sum     2720  880 3600
453 / 1685 # probability of s-genitives in the spoken data
[1] 0.2688427
427 / 1915 # probability of s-genitives in the written data
[1] 0.2229765

We’ll revisit this in a moment …

3.2 Modeling & numerical interpretation

How would we have dealt with this in Ling 104? First, we would have computed a chi-squared test like this, where the chi-squared value is computed involving differences of observed and expected values:

chisq.test(mod_x_gen, correct=FALSE)

    Pearson's Chi-squared test

data:  mod_x_gen
X-squared = 10.21, df = 1, p-value = 0.001397

Second, we might have computed the odds ratio and the logged odds ratio for the s-genitives:

(odds.ratio <-
   (mod_x_gen[2,2] / mod_x_gen[2,1]) /
   (mod_x_gen[1,2] / mod_x_gen[1,1]))
[1] 0.7804363
log(odds.ratio)
[1] -0.2479022

How do we do this now? With a binary logistic regression, which returns the G2-statistic, which is computed based on ratios of observed and expected values (not differences like chi-squared) and which also returns the (logged) odds ratio. Let’s fit our regression model,

summary(m_01 <- glm(      # make/summarize the gen. linear model m_01:
   GENITIVE ~ 1 +         # GENITIVE ~ an overall intercept (1)
   MODALITY,              # & the predictor MODALITY
   family=binomial,       # resp = binary
   data=d,                # those vars are in d
   na.action=na.exclude)) # skip cases with NA/missing data

Call:
glm(formula = GENITIVE ~ 1 + MODALITY, family = binomial, data = d,
    na.action = na.exclude)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)
(Intercept)     -1.00050    0.05495 -18.208  < 2e-16 ***
MODALITYwritten -0.24790    0.07767  -3.192  0.00141 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4004.3  on 3599  degrees of freedom
Residual deviance: 3994.1  on 3598  degrees of freedom
AIC: 3998.1

Number of Fisher Scoring iterations: 4

… compute the significance of the model,

pchisq(              # compute the area under the chi-squared curve
   q=4004.3-3994.1,  # for this chi-squared value: null - residual deviance
   df=3599-3598,     # at this df: null - residual df
   lower.tail=FALSE) # only using the right/upper tail/side
[1] 0.001404407

… the significance of the predictor, and

drop1(m_01,      # drop each droppable predictor at a time from m_01 &
   test="Chisq") # test its significance w/ a LRT
Single term deletions

Model:
GENITIVE ~ 1 + MODALITY
         Df Deviance    AIC    LRT Pr(>Chi)
<none>        3994.1 3998.1
MODALITY  1   4004.3 4006.3 10.194 0.001409 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

… the confidence intervals:

Confint(m_01) # confidence intervals
                  Estimate      2.5 %      97.5 %
(Intercept)     -1.0005020 -1.1091198 -0.89367488
MODALITYwritten -0.2479022 -0.4002591 -0.09571937

(Note how similar the chi-squared value and the G2-statistic and their p-values are.)

But what are the coefficients?

coef(m_01)
    (Intercept) MODALITYwritten
     -1.0005020      -0.2479022 

Recognize the intercept?

coef(m_01)[1]
(Intercept)
  -1.000502 
mod_x_gen # reminder what the real data were

            of    s
  spoken  1232  453
  written 1488  427
log(453 / 1232) # LOGGED odds for s-genitives in the spoken data
[1] -1.000502

Recognize what the MODALITY coefficient is?

"+"(                # add to
   log(453 / 1232), # coef(m_01)[1], the intercept
   coef(m_01)[2])   # coef(m_01)[2],m the coef for MODALITY
MODALITYwritten
      -1.248404 
log(427 / 1488) # LOGGED odds for s-genitives in the written data
[1] -1.248404

Let’s make ‘predictions’, which we add to the data: first, we compute the predicted probabilities:

d$PRED_PP_S <- predict( # make d$PRED_PP_S the result of predicting
   m_01,                 # from m_01
   type="response")      # predicted probabilities of "s"
d[c(1,3600),]
     CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1       2       of     nns   spoken         13          7  collective
3600 4040        s      ns  written          7          6     animate
     POR_FINAL_SIB  POR_DEF PRED_PP_S
1           absent definite 0.2688427
3600        absent definite 0.2229765

Note: if you don’t say type="response", predict gives you log odds:

predict(m_01)[c(1, 3600)]
        1      3600
-1.000502 -1.248404 

But how do we get the predicted probabilities (from the (logged) odds)?

"/"(                                   # divide
   exp(predict(m_01)[c(1, 3600)]),     #    the odds by
   (1+exp(predict(m_01)[c(1, 3600)]))) # 1+ the odds
        1      3600
0.2688427 0.2229765 
plogis(predict(m_01)[c(1, 3600)]) # the short version
        1      3600
0.2688427 0.2229765 

Then, we compute the categorical predictions:

d$PRED_CAT <- factor(         # make d$PRED_CAT the result of checking
   ifelse(d$PRED_PP_S>=0.5,   # if the predicted prob. of "s" is >=0.5
      levels(d$GENITIVE)[2],  # if yes, predict "s"
      levels(d$GENITIVE)[1])) # otherwise, predict "of"

Actually, the much shorter though much less intuitive version of generating d$PRED_CAT is this:

d$PRED_CAT <- factor(levels(d$GENITIVE)[1+(d$PRED_PP_S>=0.5)])

On top of that, let’s also add something else, namely a column called PRED_PP_OBS that contains, for each case, the predicted probability of the level that was observed in the data, i.e. what the model should have predicted. How does that work? Consider the head of d:

head(d)
  CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1    2       of     nns   spoken         13          7  collective
2    3       of     nns   spoken         22          7     animate
3    4       of     nns   spoken         11          8     animate
4    5       of     nns   spoken         26          4  collective
5    6        s     nns   spoken          8          4     animate
6    7        s     nns   spoken          7          3     animate
  POR_FINAL_SIB  POR_DEF PRED_PP_S PRED_CAT
1        absent definite 0.2688427       of
2        absent definite 0.2688427       of
3       present definite 0.2688427       of
4        absent definite 0.2688427       of
5        absent definite 0.2688427       of
6        absent definite 0.2688427       of

In the first case, the model predicted s with a probability of 0.2688427 and, thus, of with a probability of 1-0.2688427=0.7311573. The actual observation, what the speaker really said, is of so PRED_PP_OBS[1] should be 0.7311573. In the fifth case, the model again predicted s with a probability of 0.2688427 and, thus of with a probability of 1-0.2688427=0.7311573. The actual observation, what the speaker really said, is s so PRED_PP_OBS[2] should be 0.2688427. How do we do that efficiently for all cases (and in the spirit of Section 3.5.2, efficiently means ‘without a loop’)?

d$PRED_PP_OBS <- ifelse( # d$PRED_PP_OBS is determined by ifelse
   d$GENITIVE=="s",       # if the obs genitive is the 2nd level of the response
     d$PRED_PP_S,        # take its predicted probability
   1-d$PRED_PP_S)        # otherwise take 1 minus its predicted probability
head(d)
  CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1    2       of     nns   spoken         13          7  collective
2    3       of     nns   spoken         22          7     animate
3    4       of     nns   spoken         11          8     animate
4    5       of     nns   spoken         26          4  collective
5    6        s     nns   spoken          8          4     animate
6    7        s     nns   spoken          7          3     animate
  POR_FINAL_SIB  POR_DEF PRED_PP_S PRED_CAT PRED_PP_OBS
1        absent definite 0.2688427       of   0.7311573
2        absent definite 0.2688427       of   0.7311573
3       present definite 0.2688427       of   0.7311573
4        absent definite 0.2688427       of   0.7311573
5        absent definite 0.2688427       of   0.2688427
6        absent definite 0.2688427       of   0.2688427

Here, too, a shorter but much less intuitive version is available; if you have time to waste, try and find it, but for some weird reason it’s a brain teaser, consider yourself warned.

We’ll get to why this is useful in a moment … But let’s now evaluate the predictive capacity of the model (I’ll explain McFadden’s R2 below):

(c_m <- table(          # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,  # observed genitives in the rows
   "PREDS"=d$PRED_CAT)) # predicted orders in the columns
    PREDS
OBS    of
  of 2720
  s   880
# evaluating the confusion matrix here in the usual way doesn't make much sense,
# which is why we discuss that in the next section
c(R2s(m_01),
  "C"=C.score(d$GENITIVE, d$PRED_PP_S))
  McFadden R-squared Nagelkerke R-squared                    C
         0.002545694          0.004212717          0.530915775 

3.3 Visual interpretation

Let’s visualize the nature of the effect of MODALITY based on the predictions:

(mo_d <- data.frame( # make mo_d a data frame of
   mo <- effect(     # the effect
      "MODALITY",    # of MODALITY
      m_01)))        # in m_01
  MODALITY       fit          se     lower     upper
1   spoken 0.2688427 0.010800758 0.2482073 0.2905308
2  written 0.2229765 0.009511784 0.2048903 0.2421729
plot(mo,            # plot the effect of MODALITY
   type="response", # but plot predicted probabilities
   ylim=c(0, 1),    # ylim: the whole range of possible probabilities
   grid=TRUE)       # & w/ a grid

3.4 Excursus: deviance derived from logloss

We saw above that the deviance of a glm is again the sum of squared residuals, which means we again want to reduce the deviance of our model relative to a null model by adding (hopefully good) predictors. But what is deviance here in the case of a glm exactly, how is it computed? To understand, let’s look at the predictions of m_01 again:

head(d[,c(2,4,10:12)])
  GENITIVE MODALITY PRED_PP_S PRED_CAT PRED_PP_OBS
1       of   spoken 0.2688427       of   0.7311573
2       of   spoken 0.2688427       of   0.7311573
3       of   spoken 0.2688427       of   0.7311573
4       of   spoken 0.2688427       of   0.7311573
5        s   spoken 0.2688427       of   0.2688427
6        s   spoken 0.2688427       of   0.2688427

What would the best model ever have returned for these 6 cases? Something like this, where the model

  • predicts all instances of s correctly by returning a predicted probability of s of essentially 1;
  • predicts all instances of of correctly by returning a predicted probability of s of essentially 0:
  GENITIVE PRED_PP_S PRED_CAT PRED_PP_OBS
1       of   0.00001       of     0.99999
2       of   0.00001       of     0.99999
3       of   0.00001       of     0.99999
4       of   0.00001       of     0.99999
5        s   0.99999        s     0.99999
6        s   0.99999        s     0.99999

That means we could use 1-PRED_PP_OBS to quantify how much off from a perfect prediction the model is, but there is an alternative: We could taking the negative logs of PRED_PP_OBS because then

  • if the predicted probability of what was actually observed, which should be close to 1, is in fact close to 1 – i.e. the model is good – then we end up -logging something that’s very close to 1, which makes the negative logs really close to 0, meaning ‘we’re not far off’:
-log(0.8)
[1] 0.2231436
-log(0.9)
[1] 0.1053605
-log(0.95)
[1] 0.05129329
  • if the predicted probability of what was actually observed, which should be close to 1, is in fact close to 0 – i.e. the model is bad – then we end up -logging something that’s very close to 0, which makes the negative logs grow and grow, meaning ‘we’re quite far off’:
-log(0.2)
[1] 1.609438
-log(0.1)
[1] 2.302585
-log(0.05)
[1] 2.995732

I am calling these values contributions to logloss (because the mean of all those values is an evaluation metric called logloss). But what does this have to do with deviance? This:

deviance(m_01) # same as
[1] 3994.079
# the sum of all contributions to logloss times 2
sum(-log(d$PRED_PP_OBS)) * 2
[1] 3994.079

That means, the deviance of such a glm, what the summary calls “residual deviance”, expresses how far all the predicted probabilities of the events that should have been predicted are from their theoretical ideals of (very near to) 0 and (very near to) 1! That actually means we can also again express how much the deviance of the null model, the null deviance, is reduced by the predictors in percent:

  (m_00$deviance     -m_01$deviance) / m_00$deviance # same as
# (m_01$null.deviance-m_01$deviance) / m_01$null.deviance
[1] 0.002545694

This is McFadden’s R2-value – not the one that is mostly used, which is Nagelkerke’s R2, but still very useful because of its parallelism to the R2 of linear models.

3.5 Write-up

To determine whether the choice of a genitive varies as a function of the modality (spoken vs. written), a generalized linear model was fit with GENITIVE as the response variable and MODALITY as a predictor. The model was highly significant (LR=10.19, df=1, p<0.002) but explains very little of the response variable (Nagelkerke R2=0.004, McFadden’s R2=0.003, C=0.53) and has no predictive power: the model only ever predicts of-genitives. The coefficients indicate that s-genitives are more likely in spoken data than in written data, as shown in the summary table shown here. [Show a plot]

Estimate 95%-CI se z p2-tailed
Intercept (MODALITY=spoken) -1 [-1.11, -0.89] 0.05 -18.21 <0.001
MODALITYspokenwritten -0.25 [-0.4, -0.1] 0.08 -3.19 <0.002
# housekeeping: remove the predictions
str(d <- d[,1:9])
'data.frame':   3600 obs. of  9 variables:
 $ CASE         : int  2 3 4 5 6 7 8 9 10 11 ...
 $ GENITIVE     : Factor w/ 2 levels "of","s": 1 1 1 1 2 2 2 2 2 2 ...
 $ SPEAKER      : Factor w/ 2 levels "nns","ns": 1 1 1 1 1 1 1 1 1 1 ...
 $ MODALITY     : Factor w/ 2 levels "spoken","written": 1 1 1 1 1 1 1 1 1 1 ...
 $ POR_LENGTH   : int  13 22 11 26 8 7 6 5 5 5 ...
 $ PUM_LENGTH   : int  7 7 8 4 4 3 36 3 5 5 ...
 $ POR_ANIMACY  : Factor w/ 5 levels "animate","collective",..: 2 1 1 2 1 1 1 1 1 1 ...
 $ POR_FINAL_SIB: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 1 ...
 $ POR_DEF      : Factor w/ 2 levels "definite","indefinite": 1 1 1 1 1 1 1 1 1 1 ...

4 A categorical predictor

Does the choice of a genitive construction vary as a function on the degree/kind of animacy of the possessor (animate vs. collective vs. inanimate vs. locative vs. temporal)?

4.1 Exploration & preparation

Some exploration of the relevant variables:

# the predictor(s)/response on its/their own
table(d$GENITIVE)

  of    s
2720  880 
table(d$POR_ANIMACY)

   animate collective  inanimate   locative   temporal
       920        607       1671        243        159 
# the predictor(s) w/ the response
table(d$POR_ANIMACY, d$GENITIVE)

               of    s
  animate     370  550
  collective  408  199
  inanimate  1638   33
  locative    199   44
  temporal    105   54

4.2 Modeling & numerical interpretation

How would we have dealt with this in Ling 104? Again with a chi-squared test. This time, we immediately fit our regression model:

summary(m_01 <- glm(      # make/summarize the gen. linear model m_01:
   GENITIVE ~ 1 +         # GENITIVE ~ an overall intercept (1)
   POR_ANIMACY,           # & the predictor POR_ANIMACY
   family=binomial,       # resp = binary
   data=d,                # those vars are in d
   na.action=na.exclude)) # skip cases with NA/missing data

Call:
glm(formula = GENITIVE ~ 1 + POR_ANIMACY, family = binomial,
    data = d, na.action = na.exclude)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)
(Intercept)            0.39642    0.06724   5.896 3.73e-09 ***
POR_ANIMACYcollective -1.11438    0.10953 -10.174  < 2e-16 ***
POR_ANIMACYinanimate  -4.30114    0.18823 -22.851  < 2e-16 ***
POR_ANIMACYlocative   -1.90553    0.17965 -10.607  < 2e-16 ***
POR_ANIMACYtemporal   -1.06139    0.18045  -5.882 4.06e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4004.3  on 3599  degrees of freedom
Residual deviance: 2766.0  on 3595  degrees of freedom
AIC: 2776

Number of Fisher Scoring iterations: 6

… the significance of the model,

pchisq(              # compute the area under the chi-squared curve
   q=4004.3-2766,    # for this chi-squared value: null - residual deviance
   df=3599-3595,     # at this df: null - residual df
   lower.tail=FALSE) # only using the right/upper tail/side
[1] 7.926259e-267

… the significance of the predictor, and

drop1(m_01,      # drop each droppable predictor at a time from m_01 &
   test="Chisq") # test its significance w/ a LRT
Single term deletions

Model:
GENITIVE ~ 1 + POR_ANIMACY
            Df Deviance    AIC    LRT  Pr(>Chi)
<none>           2766.0 2776.0
POR_ANIMACY  4   4004.3 4006.3 1238.3 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

… the confidence intervals:

Confint(m_01) # confidence intervals
                        Estimate      2.5 %     97.5 %
(Intercept)            0.3964153  0.2651511  0.5288144
POR_ANIMACYcollective -1.1143776 -1.3303631 -0.9008787
POR_ANIMACYinanimate  -4.3011390 -4.6879154 -3.9477968
POR_ANIMACYlocative   -1.9055305 -2.2682451 -1.5625579
POR_ANIMACYtemporal   -1.0613916 -1.4207695 -0.7121414

Let’s make (numerical and categorical) ‘predictions’, add them to the data, and evaluate them:

d$PRED_PP_S <- predict( # make d$PRED_PP_S the result of predicting
   m_01,                 # from m_01
   type="response")      # predicted probabilities of "s"
# if you don't say type="response", predict gives you log odds
d$PRED_CAT <- factor(         # make d$PRED_CAT the result of checking
   ifelse(d$PRED_PP_S>=0.5,   # if the predicted prob. of "s" is >=0.5
      levels(d$GENITIVE)[2],  # if yes, predict "s"
      levels(d$GENITIVE)[1])) # otherwise, predict "of"
(c_m <- table(           # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,   # observed genitives in the rows
   "PREDS"=d$PRED_CAT)) # predicted orders in the columns
    PREDS
OBS    of    s
  of 2350  370
  s   330  550
c( # evaluate the confusion matrix
   "Prec. for s"     =c_m[ "s", "s"] / sum(c_m[    , "s"]),
   "Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s",    ]),
   "Prec. for of"    =c_m["of","of"] / sum(c_m[    ,"of"]),
   "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of",    ]),
   "Acc. (overall)"  =mean(d$GENITIVE==d$PRED_CAT))
     Prec. for s  Acc./rec. for s     Prec. for of Acc./rec. for of
       0.5978261        0.6250000        0.8768657        0.8639706
  Acc. (overall)
       0.8055556 
c(R2s(m_01),
  "Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(d$GENITIVE, d$PRED_PP_S))
  McFadden R-squared Nagelkerke R-squared        Cohen's kappa
           0.3092390            0.4336234            0.4815668
                   C
           0.8472389 

Again, we also add our column PRED_PP_OBS with, for each case, the predicted probability of the level that was observed in the data:

d$PRED_PP_OBS <- ifelse( # d$PRED_PP_OBS is determined by ifelse
   d$GENITIVE=="s",      # if the obs genitive is the 2nd level of the response
     d$PRED_PP_S,        # take its predicted probability
   1-d$PRED_PP_S)        # otherwise take 1 minus its predicted probability
head(d)
  CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1    2       of     nns   spoken         13          7  collective
2    3       of     nns   spoken         22          7     animate
3    4       of     nns   spoken         11          8     animate
4    5       of     nns   spoken         26          4  collective
5    6        s     nns   spoken          8          4     animate
6    7        s     nns   spoken          7          3     animate
  POR_FINAL_SIB  POR_DEF PRED_PP_S PRED_CAT PRED_PP_OBS
1        absent definite 0.3278418       of   0.6721582
2        absent definite 0.5978261        s   0.4021739
3       present definite 0.5978261        s   0.4021739
4        absent definite 0.3278418       of   0.6721582
5        absent definite 0.5978261        s   0.5978261
6        absent definite 0.5978261        s   0.5978261

4.3 Visual interpretation

Let’s visualize the nature of the effect of POR_ANIMACY based on the predictions:

(an_d <- data.frame( # make an_d a data frame of
   an <- effect(      # the effect
      "POR_ANIMACY",  # of POR_ANIMACY
      m_01)))         # in m_01
  POR_ANIMACY        fit          se      lower      upper
1     animate 0.59782609 0.016165922 0.56577463 0.62906282
2  collective 0.32784185 0.019053448 0.29164055 0.36621363
3   inanimate 0.01974865 0.003403457 0.01407325 0.02764863
4    locative 0.18106996 0.024702646 0.13756935 0.23458435
5    temporal 0.33962264 0.037557428 0.27028269 0.41659580
plot(an,            # plot the effect of POR_ANIMACY
   type="response", # but plot predicted probabilities
   ylim=c(0, 1),    # ylim: the whole range of possible probabilities
   grid=TRUE)       # & w/ a grid

4.4 Excursus: deviance derived from logloss

The way the contributions to logloss and the deviance are computed do not change just because we have (a) different predictor(s) now: The deviance is still the sum of all contributions to logloss times 2:

deviance(m_01)
[1] 2765.995
sum(-log(d$PRED_PP_OBS)) * 2
[1] 2765.995
m_01$deviance
[1] 2765.995

4.5 Excursus: model comparison

Given a lot of previous literature on the genitive alternation, you might actually have hypothesized that the only thing one needs is a distinction between animate (and maybe collective) and all inanimate possessors. Let’s use model comparison to check this. First, we create a version of POR_ANIMACY that conflates levels as desired:

d$POR_ANIMACY_confl <- d$POR_ANIMACY # create a copy of POR_ANIMACY
levels(d$POR_ANIMACY_confl) <-       # overwrite the copy's levels such that
   c("anim/coll", "anim/coll",       # "animate" and "collective" become "anim/coll"
      "inanim/loc/temp", "inanim/loc/temp", "inanim/loc/temp") # all others become "inanim/loc/temp"
table(POR_ANIMACY=d$POR_ANIMACY, POR_ANIMACY_confl=d$POR_ANIMACY_confl) # check you did it right
            POR_ANIMACY_confl
POR_ANIMACY  anim/coll inanim/loc/temp
  animate          920               0
  collective       607               0
  inanimate          0            1671
  locative           0             243
  temporal           0             159
head(d <- d[,c(1:7,13,8:12)])
  CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1    2       of     nns   spoken         13          7  collective
2    3       of     nns   spoken         22          7     animate
3    4       of     nns   spoken         11          8     animate
4    5       of     nns   spoken         26          4  collective
5    6        s     nns   spoken          8          4     animate
6    7        s     nns   spoken          7          3     animate
  POR_ANIMACY_confl POR_FINAL_SIB  POR_DEF PRED_PP_S PRED_CAT PRED_PP_OBS
1         anim/coll        absent definite 0.3278418       of   0.6721582
2         anim/coll        absent definite 0.5978261        s   0.4021739
3         anim/coll       present definite 0.5978261        s   0.4021739
4         anim/coll        absent definite 0.3278418       of   0.6721582
5         anim/coll        absent definite 0.5978261        s   0.5978261
6         anim/coll        absent definite 0.5978261        s   0.5978261

We already have the model with POR_ANIMACY as the only predictor so we only need to fit a new model with POR_ANIMACY_confl as the only predictor:

summary(m_01_animconfl <- glm( # make/summarize the gen. linear model m_01_animconfl:
   GENITIVE ~ 1 + POR_ANIMACY_confl,    # ORDER ~ an overall intercept (1) & POR_ANIMACY_confl
   family=binomial, data=d, na.action=na.exclude)) # the other arguments

Call:
glm(formula = GENITIVE ~ 1 + POR_ANIMACY_confl, family = binomial,
    data = d, na.action = na.exclude)

Coefficients:
                                 Estimate Std. Error z value Pr(>|z|)
(Intercept)                      -0.03799    0.05119  -0.742    0.458
POR_ANIMACY_conflinanim/loc/temp -2.65829    0.10377 -25.616   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4004.3  on 3599  degrees of freedom
Residual deviance: 3093.4  on 3598  degrees of freedom
AIC: 3097.4

Number of Fisher Scoring iterations: 5

… and then we do a comparison with anova just like above, the only difference being the argument to test (because we’re now fitting a generalized linear model).

anova(m_01,        # compare the model w/ the 5-level predictor POR_ANIMACY
   m_01_animconfl, # & the model w/ the 2-level predictor POR_ANIMACY_confl
   test="Chisq")   # using a LRT
Analysis of Deviance Table

Model 1: GENITIVE ~ 1 + POR_ANIMACY
Model 2: GENITIVE ~ 1 + POR_ANIMACY_confl
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)
1      3595     2766.0
2      3598     3093.4 -3  -327.39 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference between the two models is hugely significant, we cannot reduce POR_ANIMACY to the only two levels we were considering.

4.6 Write-up

To determine whether the choice of a genitive varies as a function of the degree/kind of animacy of the possessor (animate vs. collective vs. inanimate vs. locative vs. temporal), a generalized linear model was fit with GENITIVE as the response variable and POR_ANIMACY as a predictor. The model was highly significant (LR=1238.3, df=4, p<0.001) and explains a decent amount of the response variable (Nagelkerke R2=0.434, McFadden’s R2=0.309, Cohen’s κ=0.482, C=0.85) with a good amount of predictive power: [list precision & accuracy/recall scores for both genitives]. The summary table of the model is shown here.

Estimate 95%-CI se z p2-tailed
Intercept (POR_ANIMACY=animate) 0.40 [0.27, 0.53] 0.07 5.9 <0.001
POR_ANIMACYanimatecollective -1.11 [-1.33, -0.9] 0.11 -10.17 <0.002
POR_ANIMACYanimateinanimate -4.30 [-4.69, -3.95] 0.19 -22.85 <0.002
POR_ANIMACYanimatelocative -1.91 [-2.27, -1.56] 0.18 -10.61 <0.002
POR_ANIMACYanimatetemporal -1.06 [-1.42, -0.71] 0.18 -5.88 <0.002

The model predicts s-genitives for animate possessors but of-genitives for collective/temporal possessors, then locative possessors, and then nearly exclusively for inanimate possessors. [Show a plot] A post-hoc comparison of the above model with one in which POS_ANIMACY was reduced to 2 levels (animate/collective vs. inanimate/locative/temporal) showed that this reduced predictor is highly significantly worse (LR=327.39, p<0.0001).

# housekeeping: remove the predictions
str(d <- d[,1:9])
'data.frame':   3600 obs. of  9 variables:
 $ CASE             : int  2 3 4 5 6 7 8 9 10 11 ...
 $ GENITIVE         : Factor w/ 2 levels "of","s": 1 1 1 1 2 2 2 2 2 2 ...
 $ SPEAKER          : Factor w/ 2 levels "nns","ns": 1 1 1 1 1 1 1 1 1 1 ...
 $ MODALITY         : Factor w/ 2 levels "spoken","written": 1 1 1 1 1 1 1 1 1 1 ...
 $ POR_LENGTH       : int  13 22 11 26 8 7 6 5 5 5 ...
 $ PUM_LENGTH       : int  7 7 8 4 4 3 36 3 5 5 ...
 $ POR_ANIMACY      : Factor w/ 5 levels "animate","collective",..: 2 1 1 2 1 1 1 1 1 1 ...
 $ POR_ANIMACY_confl: Factor w/ 2 levels "anim/coll","inanim/loc/temp": 1 1 1 1 1 1 1 1 1 1 ...
 $ POR_FINAL_SIB    : Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 1 ...

5 A numeric predictor

Does the choice of a genitive construction vary as a function on the length of the possessor (in characters)?

5.1 Exploration & preparation

# the predictor(s)/response on its/their own
table(d$GENITIVE)

  of    s
2720  880 
hist(d$POR_LENGTH); hist(d$POR_LENGTH, breaks="FD")

# the predictor(s) w/ the response
spineplot(d$GENITIVE ~ d$POR_LENGTH)

The predictor POR_LENGTH looks like it’s going to cause problems if we use it like this, given its very skewed distribution, but we’ll try for now and fit our regression model.

5.2 Modeling & numerical interpretation

Here’s our initial model, its significance, its predictor’s significance, and its confidence intervals:

summary(m_01 <- glm(      # make/summarize the gen. linear model m_01:
   GENITIVE ~ 1 +         # GENITIVE ~ an overall intercept (1)
   POR_LENGTH,            # & the predictor POR_LENGTH
   family=binomial,       # resp = binary
   data=d,                # those vars = in x
   na.action=na.exclude)) # skip cases with NA/missing data

Call:
glm(formula = GENITIVE ~ 1 + POR_LENGTH, family = binomial, data = d,
    na.action = na.exclude)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.337551   0.090941   3.712 0.000206 ***
POR_LENGTH  -0.126693   0.008189 -15.471  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4004.3  on 3599  degrees of freedom
Residual deviance: 3598.6  on 3598  degrees of freedom
AIC: 3602.6

Number of Fisher Scoring iterations: 6
pchisq(              # compute the area under the chi-squared curve
   q=4004.3-3598.6,  # for this chi-squared value: null - residual deviance
   df=3599-3598,     # at this df: null - residual df
   lower.tail=FALSE) # only using the right/upper tail/side
[1] 3.163282e-90
drop1(m_01,      # drop each droppable predictor at a time from m_01 &
   test="Chisq") # test its significance w/ a LRT
Single term deletions

Model:
GENITIVE ~ 1 + POR_LENGTH
           Df Deviance    AIC    LRT  Pr(>Chi)
<none>          3598.6 3602.6
POR_LENGTH  1   4004.3 4006.3 405.69 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Confint(m_01) # confidence intervals
              Estimate      2.5 %     97.5 %
(Intercept)  0.3375506  0.1607386  0.5172721
POR_LENGTH  -0.1266934 -0.1431028 -0.1110008

Let’s make ‘predictions’ and evaluate them:

d$PRED_PP_S <- predict( # make d$PRED_PP_S the result of predicting
   m_01,                 # from m_01
   type="response")      # predicted probabilities of "s"
# if you don't say type="response", predict gives you log odds
d$PRED_CAT <- factor(         # make d$PRED_CAT the result of checking
   ifelse(d$PRED_PP_S>=0.5,   # if the predicted prob. of "s" is >=0.5
      levels(d$GENITIVE)[2],  # if yes, predict "s"
      levels(d$GENITIVE)[1])) # otherwise, predict "of"
(c_m <- table(           # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,     # observed genitives in the rows
   "PREDS"=d$PRED_CAT)) # predicted orders in the columns
    PREDS
OBS    of    s
  of 2712    8
  s   868   12
c( # evaluate the confusion matrix
   "Prec. for s"     =c_m[ "s", "s"] / sum(c_m[    , "s"]),
   "Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s",    ]),
   "Prec. for of"    =c_m["of","of"] / sum(c_m[    ,"of"]),
   "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of",    ]),
   "Acc. (overall)"  =mean(d$GENITIVE==d$PRED_CAT))
     Prec. for s  Acc./rec. for s     Prec. for of Acc./rec. for of
      0.60000000       0.01363636       0.75754190       0.99705882
  Acc. (overall)
      0.75666667 
c(R2s(m_01),
  "Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(d$GENITIVE, d$PRED_PP_S))
  McFadden R-squared Nagelkerke R-squared        Cohen's kappa
          0.10131504           0.15878289           0.01597604
                   C
          0.70785741 

Again, we also add our column PRED_PP_OBS with, for each case, the predicted probability of the level that was observed in the data:

d$PRED_PP_OBS <- ifelse( # d$PRED_PP_OBS is determined by ifelse
   d$GENITIVE=="s",      # if the obs genitive is the 2nd level of the response
     d$PRED_PP_S,        # take its predicted probability
   1-d$PRED_PP_S)        # otherwise take 1 minus its predicted probability
head(d)
  CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1    2       of     nns   spoken         13          7  collective
2    3       of     nns   spoken         22          7     animate
3    4       of     nns   spoken         11          8     animate
4    5       of     nns   spoken         26          4  collective
5    6        s     nns   spoken          8          4     animate
6    7        s     nns   spoken          7          3     animate
  POR_ANIMACY_confl POR_FINAL_SIB  PRED_PP_S PRED_CAT PRED_PP_OBS
1         anim/coll        absent 0.21257664       of   0.7874234
2         anim/coll        absent 0.07946020       of   0.9205398
3         anim/coll       present 0.25805992       of   0.7419401
4         anim/coll        absent 0.04943126       of   0.9505687
5         anim/coll        absent 0.33715543       of   0.3371554
6         anim/coll        absent 0.36602611       of   0.3660261

5.3 Visual interpretation

Let’s visualize the nature of the effect of POR_LENGTH based on the predictions:

(le_d <- data.frame( # make le_d a data frame of
   le <- effect(     # the effect
      "POR_LENGTH",  # of POR_LENGTH
      m_01)))        # in m_01
  POR_LENGTH          fit           se        lower        upper
1          1 5.525199e-01 2.069478e-02 5.116986e-01 5.926452e-01
2         52 1.925762e-03 6.668753e-04 9.765240e-04 3.794212e-03
3        100 4.409307e-06 3.257399e-06 1.036429e-06 1.875843e-05
4        150 7.820963e-09 8.977002e-09 8.245941e-10 7.417887e-08
5        200 1.387229e-11 2.160024e-11 6.557839e-13 2.934510e-10
plot(le,            # plot the effect of POR_LENGTH
   type="response", # but plot predicted probabilities
   ylim=c(0, 1),    # ylim: the whole range of possible probabilities
   grid=TRUE)       # & w/ a grid

5.4 Excursus: what if we log-transform POR_LENGTH?

Ok, those results were pretty bad – let’s do this again because maybe it’s really smarter to transform the possessor lengths and, given the shape of this variable’s histogram and because no length is 0, we try the log transformation:

# the predictor(s)/response on its/their own
d$POR_LENGTH_LOG <- log2(d$POR_LENGTH)
table(d$GENITIVE)

  of    s
2720  880 
hist(d$POR_LENGTH_LOG); hist(d$POR_LENGTH_LOG, breaks="FD")

# the predictor(s) w/ the response
spineplot(d$GENITIVE ~ d$POR_LENGTH_LOG)

Let’s fit our 2nd regression model and all its other stats:

summary(m_02 <- glm(      # make/summarize the gen. linear model m_02:
   GENITIVE ~ 1 +         # GENITIVE ~ an overall intercept (1)
   POR_LENGTH_LOG,        # & the predictor POR_LENGTH_LOG
   family=binomial,       # resp = binary
   data=d,                # those vars = in x
   na.action=na.exclude)) # skip cases with NA/missing data

Call:
glm(formula = GENITIVE ~ 1 + POR_LENGTH_LOG, family = binomial,
    data = d, na.action = na.exclude)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)     1.90960    0.17609   10.85   <2e-16 ***
POR_LENGTH_LOG -0.90493    0.05325  -16.99   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4004.3  on 3599  degrees of freedom
Residual deviance: 3643.1  on 3598  degrees of freedom
AIC: 3647.1

Number of Fisher Scoring iterations: 4
pchisq(              # compute the area under the chi-squared curve
   q=4004.3-3643.1,  # for this chi-squared value: null - residual deviance
   df=3599-3598,     # at this df: null - residual df
   lower.tail=FALSE) # only using the right/upper tail/side
[1] 1.542725e-80
drop1(m_02,      # drop each droppable predictor at a time from m_02 &
   test="Chisq") # test its significance w/ a LRT
Single term deletions

Model:
GENITIVE ~ 1 + POR_LENGTH_LOG
               Df Deviance    AIC    LRT  Pr(>Chi)
<none>              3643.1 3647.1
POR_LENGTH_LOG  1   4004.3 4006.3 361.18 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Confint(m_02) # confidence intervals
                 Estimate     2.5 %     97.5 %
(Intercept)     1.9096029  1.567704  2.2581724
POR_LENGTH_LOG -0.9049291 -1.010777 -0.8019605

Let’s make ‘predictions’ and evaluate them:

d$PRED_PP_S <- predict( # make d$PRED_PP_S the result of predicting
   m_02,                 # from m_02
   type="response")      # predicted probabilities of "s"
# if you don't say type="response", predict gives you log odds
d$PRED_CAT <- factor(         # make d$PRED_CAT the result of checking
   ifelse(d$PRED_PP_S>=0.5,   # if the predicted prob. of "s" is >=0.5
      levels(d$GENITIVE)[2],  # if yes, predict "s"
      levels(d$GENITIVE)[1])) # otherwise, predict "of"
(c_m <- table(           # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,   # observed genitives in the rows
   "PREDS"=d$PRED_CAT)) # predicted orders in the columns
    PREDS
OBS    of    s
  of 2614  106
  s   809   71
c( # evaluate the confusion matrix
   "Prec. for s"     =c_m[ "s", "s"] / sum(c_m[    , "s"]),
   "Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s",    ]),
   "Prec. for of"    =c_m["of","of"] / sum(c_m[    ,"of"]),
   "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of",    ]),
   "Acc. (overall)"  =mean(d$GENITIVE==d$PRED_CAT))
     Prec. for s  Acc./rec. for s     Prec. for of Acc./rec. for of
      0.40112994       0.08068182       0.76365761       0.96102941
  Acc. (overall)
      0.74583333 
c(R2s(m_02),
  "Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(d$GENITIVE, d$PRED_PP_S))
  McFadden R-squared Nagelkerke R-squared        Cohen's kappa
          0.09019761           0.14222054           0.05715463
                   C
          0.70785741 

Great … all the work for nothing much. Again, we also add our column PRED_PP_OBS with, for each case, the predicted probability of the level that was observed in the data:

d$PRED_PP_OBS <- ifelse( # d$PRED_PP_OBS is determined by ifelse
   d$GENITIVE=="s",       # if the obs genitive is the 2nd level of the response
     d$PRED_PP_S,        # take its predicted probability
   1-d$PRED_PP_S)        # otherwise take 1 minus its predicted probability
head(d)
  CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1    2       of     nns   spoken         13          7  collective
2    3       of     nns   spoken         22          7     animate
3    4       of     nns   spoken         11          8     animate
4    5       of     nns   spoken         26          4  collective
5    6        s     nns   spoken          8          4     animate
6    7        s     nns   spoken          7          3     animate
  POR_ANIMACY_confl POR_FINAL_SIB  PRED_PP_S PRED_CAT PRED_PP_OBS
1         anim/coll        absent 0.19169521       of   0.8083048
2         anim/coll        absent 0.10660805       of   0.8933919
3         anim/coll       present 0.22777155       of   0.7722285
4         anim/coll        absent 0.08754694       of   0.9124531
5         anim/coll        absent 0.30891765       of   0.3089176
6         anim/coll        absent 0.34731689       of   0.3473169
  POR_LENGTH_LOG
1       3.700440
2       4.459432
3       3.459432
4       4.700440
5       3.000000
6       2.807355

Let’s visualize the nature of the effect of POR_LENGTH based on the predictions:

(le_d <- data.frame(    # make le_d a data frame of
   le <- effect(        # the effect
      "POR_LENGTH_LOG", # of POR_LENGTH_LOG
      m_02)))           # in m_02
  POR_LENGTH_LOG         fit          se       lower       upper
1              0 0.870974529 0.019788298 0.826995055 0.905057520
2              2 0.524915543 0.019097226 0.487415805 0.562136377
3              4 0.153149017 0.007572996 0.138887878 0.168587815
4              6 0.028749293 0.004293408 0.021429263 0.038471473
5              8 0.004821536 0.001237943 0.002913467 0.007969237
plot(le,            # plot the effect of POR_LENGTH_LOG
   type="response", # but plot predicted probabilities
   ylim=c(0, 1),    # ylim: the whole range of possible probabilities
   grid=TRUE)       # & w/ a grid

5.5 Excursus: deviance derived from logloss

The way the contributions to logloss and the deviance are computed do not change just because we have (a) different predictor(s) now: The deviance is still the sum of all contributions to logloss times 2:

deviance(m_02)
[1] 3643.097
sum(-log(d$PRED_PP_OBS)) * 2
[1] 3643.097
m_02$deviance
[1] 3643.097

5.6 Write-up

To determine whether the choice of a genitive varies as a function of the length of the possessor, a generalized linear model was fit with GENITIVE as the response variable and POR_LENGTH as a predictor; however, the initial model resulted in extremely poor fit probably in part because of the very long right tail of the predictor, which was therefore logged to the base of 2. A new model with this predictor was highly significant (LR=361.18, df=1, p<0.001) but explains only a small amount of the response variable (Nagelkerke R2=0.142, McFadden’s R2=0.101, Cohen’s κ=0.057, C=0.7088) with a fairly bad amount of predictive power: [list precision & accuracy/recall scores for both genitives]. The summary table of the model is shown here.

Estimate 95%-CI se z p2-tailed
Intercept (POR_LENGTH=0) 1.91 [0.27, 0.53] 0.18 10.85 <0.001
POR_LENGTH0→1 -0.9 [-1.33, -0.9] 0.05 -16.99 <0.001

The model indicates a negative correlation between s-genitives and possessor length: the longer the possessor, the less likely s-genitives become and the more likely of-genitives become. [Show a plot]

sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods
[8] base

other attached packages:
 [1] multcomp_1.4-25 TH.data_1.1-2   MASS_7.3-60.0.1 survival_3.5-8
 [5] mvtnorm_1.2-4   effects_4.2-2   car_3.1-2       carData_3.0-5
 [9] STGmisc_1.0     Rcpp_1.0.12     magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] Matrix_1.6-5      jsonlite_1.8.8    survey_4.4-2      splines_4.3.3
 [5] boot_1.3-30       yaml_2.3.8        fastmap_1.1.1     lattice_0.22-6
 [9] knitr_1.46        htmlwidgets_1.6.4 nloptr_2.0.3      insight_0.19.10
[13] nnet_7.3-19       minqa_1.2.6       DBI_1.2.2         rlang_1.1.3
[17] xfun_0.43         estimability_1.5  cli_3.6.2         digest_0.6.35
[21] grid_4.3.3        rstudioapi_0.16.0 sandwich_3.1-0    lme4_1.1-35.3
[25] nlme_3.1-164      evaluate_0.23     codetools_0.2-20  zoo_1.8-12
[29] mitools_2.4       abind_1.4-5       colorspace_2.1-0  rmarkdown_2.26
[33] tools_4.3.3       htmltools_0.5.8.1